inukshuk / anystyle-cli

AnyStyle Command Line Interface
BSD 2-Clause "Simplified" License
57 stars 8 forks source link

error: undefined method `split' for nil:NilClass #11

Closed myshevchuk closed 4 years ago

myshevchuk commented 4 years ago

Hi! Thank you very much for your invaluable work!

In an attempt to produce a custom finder model, I have run into the following issue:

  1. Running anystyle -f ttx find pdf/ ttx on a bunch of pdf files produces the corresponding set of ttx files.
  2. Running then anystyle train ttx my-model.mod produces the error:

    error: undefined method `split' for nil:NilClass

I'm using ruby shipped with macOS Catalina 10.15.4: # ruby -v

ruby 2.6.3p62 (2019-04-16 revision 67580) [universal.x86_64-darwin19]

And the latest version of anystyle: # anystyle --version

anystyle version 1.3.11 (cli 1.3.0, data 1.2.0)

I wonder whether it may have something to do with the fact that the ttx files contain non-printable characters like ^C, ^D or ^E.

Thank you in advance for your help!

myshevchuk commented 4 years ago

Additional information that may be relevant for locating the source of the error:

Running anystyle train file.xml model.mod where the file contains a <sequence> containing an empty element such as

  <sequence>
    <citation-number>(18)</citation-number>
    <author>Sumino, S.; Uno, M.; Fukuyama, T.; Ryu, I.; Matsuura, M.; Yamamoto, A.; Kishikawa, Y.</author>
    <journal>J. Org. Chem.</journal>
    <title></title>
    <date>2017,</date>
    <volume>82,</volume>
    <pages>5469.</pages>
  </sequence>

produces the same error:

error: undefined method `strip' for nil:NilClass

N.B. Empty sequences <sequence></sequence> also fail but produce a different error.

If train ttx involves some xml internally that may explain the error.

inukshuk commented 4 years ago

Could you share a ttx file with me that causes the issue?

myshevchuk commented 4 years ago

The file is copyright protected, so I created a private repo for it.

inukshuk commented 4 years ago

Thanks!

The CLI tool was still based on old code so the files were never fed to the finder module at all. I hope this is fixed with 1.3.1?

However, with your test file, this just leads us to the next error which actually is related to the empty token one. For some reason the final line of your ttx has an empty token. Since the file was generated by the Finder module it's definitely something we should fix, though I'm not completely sure why it happened. I would assumed it's related to whether or not pdftotext generated and empty new line or not at the end of input. In any case, this is something you should easily be able to work around with sed or some similar tool until we fix it. Basically, if you take a look at the last line of the ttx there is nothing after the tag -- if you remove that line it should work (or add a newline or similar character there).

myshevchuk commented 4 years ago

Great! Thank you very much!

I can confirm this is fixed for 1.3.1.

If I understand it correctly, the bulk training data - ttx and xml - can be conveniently generated with the help of Finder or Parser modules, respectively, but to get really good training sets the files should anyway be edited or at least inspected manually/in a semi-automated way? So in this respect removing these eof newlines (thanks for the tip!) is just a part of the process.

By the way, I'm currently working on an Emacs interface to Anystyle-CLI intended as an everyday tool for researchers, which would provide an interactive interface for reference retrieval workflow (in develompent) as well as assist in creation of custom training sets (planned). You can check it here: https://github.com/org-roam/org-roam-bibtex/pull/44. It's still quite basic right now, though.

inukshuk commented 4 years ago

Yes, if an existing model is already good enough, the typical workflow is to have the model create the training data (i.e., you parse and results which are not good enough are edited and become the new training data). If you need to start from scratch it's more work -- this is sort of how I came up with the ttx format which, admittedly, is quite brittle: but I needed something that was easy to review, edit and diff in a text editor. With a model already in hand, it might make sense to switch to something more structured like XML, although this would likely require better tooling (it would be crazy to try to review a full text tagged in XML).

When working with the finder model, the idea was to reduce the number of tags to an absolute minimum. It would be interesting to include more specific tags, e.g., to detect abstract, keywords, main title, main authors, table of contents, sections etc. In practice however, this means that reviewing and tagging documents becomes even more time consuming than it already is; simply not worth it if you want to only extract references. Unfortunately header, footer, section titles, page numbers and blank lines, which are all very important parts of reference sections, occur on pretty much every page so reviewing usually involves checking all the pages.

The more data you have, the easier it gets to detect badly trained input: if you're curious you can check the Rake tasks in the main anystyle repo for details, but basically, what I usually did was add a new document to the training set, train the model, then parse all the documents of the training set and save the results as ttx. Then you can use a normal diffing tool in your editor to compare the result with the training document: this way you'll quickly find lines wich are potentially problematic (because they'll often get parsed wrong).

One word of caution: the default finder model is trained on many more documents than are available in the repository because of copyright issues (they were all publications from Stanford university). Unfortunately you will lose these if you create a new model based on the public training set and your own documents.

myshevchuk commented 4 years ago

Thanks for such a detailed explanation! I can't really complain about the default finder model. So far (maybe a dozen of documents) it has always correctly detected reference sections, and the plain text references as output with the -f ref flag needed only minor post-processing with a simple script to get each one onto a new line. It happens that many journals, at least in my discipline, strive to utilize as much printing space as possible and often group references into blocks (1) (a) ref1 (b) ref2 (c) ref3 \newline (2) (a) ... This is something Anystyle isn't used to handling by default yet, but is easily fixable with regular expressions.

Once each reference was on a new line, I ran -f bib parse to get descent at first glance bibtex records, which after the second glance were at most 60% correct. The main problem was that journal names were too often recognized incorrectly and parts of the strings were erroneously tagged as title or author fields. I then started to experiment with train xml on those plain text references, manually reviewing each xml output, appending it then to the core.xml file, and eventually retraining the Parser model on the core+.xml set. And I must say it is simply amazing! It took me exactly 3 such iterations to get to 90-95% success rate. A couple more, and the model now parses all new documents correctly.

So yeah, thanks a lot once again!