inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.02k stars 88 forks source link

Footnotes #129

Open edsu opened 4 years ago

edsu commented 4 years ago

I've noticed that anystyle works beautifully on endnotes (citations at the end of a document) but seems to fail to pick up footnotes (citations at the end of each page). Attached is an example.

article.pdf

I was wondering if anyone out there is training anystyle to recognize footnotes? It would be great to work on creating a training dataset if it seems like something that ought to be able to work in theory, given a good training dataset.

a-fent commented 4 years ago

I did a short project to investigate this problem last year and looked at various open-source software tools, including Anystyle. I was interested because footnotes and even sidenotes are common in many of the disciplines (esp. humanities) and document types (e.g. reports) that I work with.

My conclusion was pretty much that recognising footnotes is a hard problem which none (including grobid, Anystyle, cermine, Excite) has really solved to anywhere near the degree of reliability with which single-line citations can be parsed and bibliographies can be recognised in document structure. I think this is partly because the visual cues by which humans recognise footnotes (e.g., as in your sample document, a separation from main text, smaller type, different typeface) are lost when a PDF is converted to plain text, which all packages do - although some do try and retain layout information.

The content of footnotes is also tricky. They contain bibliographic information, but often in an abbreviated form (e.g. German history texts often use only /author/, /first word from title/ in repeated citations). They also often contain extraneous non-bibliographic information, such as remarks on sources and terms.

That said - sorry to be pessimistic - but if you have a tightly defined format (e.g. from a specific journal, where footnotes are predominantly used for bibliographies), adding training data would certainly help.

philgooch commented 4 years ago

I'd echo @a-fent 's comments. This is as much of a PDF parsing problem as it is a reference parsing problem. The way I've approached this in the past was to convert the PDF to text, then build a binary line classifier that learns to classify each line of text in the PDF as being or containing a reference or not. Getting clean text, one paragraph per line, to form the input to the classifier was the biggest challenge there.

Once you have extracted your reference lines/footnotes/endnotes, then you can parse them. Separating any references included within a footnote narrative is a separate problem though!

Having said that, the Scholarcy reference extraction API (which uses an older version of the Anystyle gem and different training data) doesn't do a bad job of your example file (YMMV). It doesn't handle the _Ibid_s, but it does get the complete references in the footnotes. (Full disclosure: I am the developer of Scholarcy)

https://ref.scholarcy.com/api/references/download?url=https%3A%2F%2Fgithub.com%2Finukshuk%2Fanystyle%2Ffiles%2F3764680%2Farticle.pdf&document_type=full_paper&reference_style=ensemble&reference_format=bibtex&engine=v1

inukshuk commented 4 years ago

Echoing @a-fent's observations above, in my experience the biggest challenge was the format of the footnotes themselves. Extracting the footnotes from the page is also a challenge (mainly because it's harder to normalize outlier lines -- for whole bibliography sections we can normalizer outliers relatively successfully, say, if you have lots of continuous lines identified as a reference and one or two other lines in between) but I found that it is really hard to then differentiate between regular footnotes and reference footnotes; especially because they are often mixed: a reference and some extra commentary for example.

If you have a well defined format where reference, I think it might be worth a try to add a few articles for training.

When parsing PDFs one thing to experiment with is the --layout / --no-layout option. But for footnotes, I think that using layout mode (which is the default if I remember correctly) should already be the best option.

edsu commented 4 years ago

Thanks for all these details. I know this is a loaded question, but if I wanted to extract footnotes from a single journal with a consistent format, how much training data do you think I would need to get in order to start generating decent results?

Also, please feel free to close this issue as it's more of a conversation than anything :-)

a-fent commented 4 years ago

If I may chip in again - my experience is that a little training goes a long way. If you have a distinctive pattern or set of characteristics that is not otherwise known, then the model only needs a small number of examples in order to classify further such examples correctly.

edsu commented 4 years ago

@a-fent thanks, that definitely is encouraging me to give it a try. By small number do you mean less than 10, 100, 1000, more?

a-fent commented 4 years ago

Training the citation parser, I have determined that the very scientific amount of "half a dozen or so" is often enough, if the format/pattern to be recognised is distinctive. i.e. I would start small and see how it goes.

inukshuk commented 4 years ago

Marking up full-text documents is a lot of work, because you need to check each line. The best way to start is use an existing model to create an initial ttx document, and then just go through and correct any offending lines.

Since the goal of the finder module is to extract reference section, we kept the number of available terms very low (it makes marking up the documents much harder if you'd try to classify lists, illustrations, etc.). Making sure titles, page headers and footers are labeled correctly is still a lot of work. But in my experience, training 1-3 documents should already get you far (and if you need to train more, the initial version produced by the original parsing should already be better).

There are some rake tasks here in the repository that can help you create a new model. rake train[finder] will re-build the finder model using the ttx files in res/finder. You can run rake check[finder] to parse all the files in res/finder (this is helps identifying inconsistent/ambiguous lines).

Most importantly rake delta[path/to/file/or/folder] pointing to a ttx file (or folder containing ttx files) will create deltas for you. Basically, it parses those files (and saves the result in the root directory) and prints a summary of the differences. You can then diff the new and the original file to see all lines which were tagged differently.

The 'goal' of the finder module is a good 'ref' format (all the references, one reference per line). This will then be fed to the parser module if you parse all the way.

If you know what your articles look like, you can also normalize those intermediate. For example, we currently use the fact that reference sections at the end of articles/chapters are contiguous to include or drop outlier lines. If you know that there are footnotes and that there are references in your footnotes, you could do something similar to all lines towards the bottom of a page if there's already been a reference detected above, or something along those lines.

amkelly commented 3 years ago

Hi there, I'm working on a data set of student papers and think I'd benefit from training a model on a handful of the papers I'm working on, perhaps in addition to the standard model.

I see you've suggested marking up some full documents as part of the answer to the original question and re-build the finder model with the .ttx files in res/finder. I am new to the machine learning and document tagging end of the software, so I wonder if you can suggest a good way to mark up the documents I have in the ttx format your using, even if that just ends up being a useful shortcut in a text editor.

Thanks so much for all the work you all have done on Anystyle, it's done a remarkably good job on my data set so far, it's been a great piece of software to work with.

I hope my comment here is still on-topic enough with this discussion, it didn't seem to quite warrant a new issue here.

Thanks again.

inukshuk commented 3 years ago

@amkelly I'll try to come back to this when I have more time. Just briefly: the .ttx format is clearly the result of my personal workflow: I was working in VIM and this made it easy to create new .ttx by hand by simply adding the 'prefix/inset' to each line and then quickly going through the document adding tags to the lines (if a line has no tag, it keeps using the previous tag so this means you could very quickly create a new .ttx by tagging only a handful of lines).

Once you have an OK model and you're trying to make it more consistent or detect new features, my workflow, typically is to add a new document to the model and then parse this document again (along with other source documents) and check for inconsistencies. Ideally, all your source document will be tagged perfectly and consistently; in practice they will contain mistakes. As you add more training data to your model, the model will either get better or worse: that means you will begin to see more errors/differences when you parse your source documents again. Each such difference is either an error in the way the source document was tagged (that is, your model improved and you should fix the source document) or it's an error in the way the document was tagged by the new model (that is, your model got worse at tagging the line in question; this may be because the new document you added still contains errors). In either case, you'll have to quickly compare two versions of the same .ttx document and fix either the one or the other: in VIM's diff mode this is really easy because you can view just the lines which are different side by side, make your changes and have the lines 'disappear'.

amkelly commented 3 years ago

Thanks so much for the quick reply. This definitely looks like enough to get me up and running!

Thanks again!

cboulanger commented 2 years ago

Hi, I am trying to train a finder model with self-annotated material that i converted from the EXCite/Cermine format (.csv) to .ttx. Unfortunately, I am getting an error:

anystyle train train/finder train/finder.mod
error: undefined method `gsub' for nil:NilClass

I am using the CLI from gem install anystyle-cli. Any tipps on how to debug the code (i.e. set up the development version)? I don't speak ruby, unfortuately, but I can make sense of the code and have the rubymine IDE to run a debugger.

cboulanger commented 2 years ago

I am posting this here because I have the exact same use case: trying to parse references from footnotes. My training material consists of 40 documents with annotated references in footnotes.

inukshuk commented 2 years ago

My first guess is that there's a syntax error in the .ttx somewhere. If you can share a file where this happens I can take a quick look.

cboulanger commented 2 years ago

@inukshuk Because the material is copyrighted, I cannot publicly share the documents but do you use Gitter by any chance?- I could send you the link there.

cboulanger commented 2 years ago

It might be good to catch the exception and output the name of the document and (if possible) line number in which the error is encountered.

cboulanger commented 2 years ago

BTW, I am happy to add .ttx as an export format to the online annotation tool https://cboulanger.github.io/excite-docker/

inukshuk commented 2 years ago

Can you try this, e.g., in irb:

require 'anystyle'
doc = AnyStyle::Document.open './path/to/file.ttx'

And then just to see if the file was opened without problems, e.g.:

doc.pages
doc.title
doc.references

If there are any issues opening the ttx files I'd expect this to fail.

cboulanger commented 2 years ago

Ok, thanks, that was useful. Turns out that a last line

blank         | 

will crash the document parser for some reason. I'll make sure that the converter script removes those, although the document parser should probably be made more robust in that regard.

cboulanger commented 2 years ago

I am training a finder model, using 38 .ttx containing references in footnotes from files that I converted from a different format (happy to send them offline, they are from copyrighted material)

$ anystyle --overwrite train path/to/ttx test/finder.mod
$ anystyle -F test/finder.mod -f csl --verbose find test/10.1515_zfrs-1980-0104.pdf
Analyzing 10.1515_zfrs-1980-0104.pdf ...
no references found.

The document is one that has been used to train the model, so it should definitely find at least the references it had been trained with.

The .ttx format has the drawback that it cannot deal with cases where there are several references in the same line, such as in this made-up example:

53 See Walter, Journal of meaningless nonsense, 1944, p. 4-33; also Miller, Introduction, in: Collected papers, London, Macmillan 1964, p. 5-15.

In the EXcite project, they use a pseudo-xml markup with <ref> / </ref> tags that allows to separate these. Is this something you could support?

cboulanger commented 2 years ago

After having looked closer at the source code, I can see that AnyStyle currently trains the finder model by labelling whole lines, which I fear defeats the goal of parsing footnote-based references which do not cleanly start and end in separate lines. Would an alternative approach be possible in which whole documents are parsed by separating not lines but sentences, and these sentences then be labelled? This would allow to deal with a large body of literature in the Humanities which do not put the references in to the footnotes and not into a separate bibliography.,

inukshuk commented 2 years ago

I doubt that a finer-than-line granularity of the finder model would be practical. References in a paper or book make up only a small part of the text. Yet, for the model to be consistent you have to mark up the full input. This means, for a model based on word tokens, you'd have mark up every word in the text. Obviously, most of the labeling will be done by previous versions of your model, but still: each word in your texts is now a potential source for inconsistencies that has to be accounted for.

Instead, I'd address footnotes the same way as we do reference sections: with separate parsing steps of increasing granularity. The finder model is line-based, it's goal is just to identify lines with references; an intermediate step is to join lines correctly to end up with one line per reference to feed to the parser model. However, you could adapt the parser model to handle multiple references (or deal with line-breaks) instead and then also to label non-pertinent text (as you'd be likely to find in footnotes).

cboulanger commented 2 years ago

@inukshuk You are right - it makes more sense to offload the task of finding multiple references in one line to the parser which does reference segmentation. So the first step would be to identify the references lines in the footnotes. Unfortunately, when I train a model using my converted annotations and use the model to analyze a document with anystyle -F test/finder.mod -f csl --verbose find xxx.pdf, I get "no references found". Any idea how I can debug this? Happy to post the model source and some training doc sample if that's useful.

inukshuk commented 2 years ago

To get started, I'd save the .ttx output and inspect it. If there were no references found it means that either no reference lines where detected at all or that they were sparse and therefore dropped. Generally, if you don't have contiguous reference sections, you'll probably want to keep all reference lines by using the --solo option.

cboulanger commented 2 years ago

@inukshuk Yay! The --solo option was the key!

$ anystyle -F test/finder.mod -f csl --verbose find --solo test/10.1515_zfrs-1980-0104.pdf test
Analyzing 10.1515_zfrs-1980-0104.pdf ...
47 references found.

Cool. The results are ok given all the noise in the footnotes that confuses the parser. Interestingly, if only a last name has been provided, it is translated into CSL author[].given property.

A question on the parser model that ships with the executable - has it been trained with all of the annotations in anystyle/res/parser or just the "gold" one? Is there a way to train a parser models using several files or do all annotations have to be put into one XML file?

cboulanger commented 2 years ago

To further improve the results, I need to create an evaluation workflow. I assume the "check" command can be used for this? Would I be splitting my gold into training and testing material and then run check to see the error rate?

If I run check on the material that I trained the model with, I get the following results

$ anystyle -F test/finder.mod check path/to/ttx-test/

Checking 10.1515_zfrs-1980-0104.tt   1 seq 100.00%     20 tok  2.20%  0s
Checking 10.1515_zfrs-1980-0105.tt   1 seq 100.00%      8 tok  0.44%  0s
Checking 10.1515_zfrs-1980-0202.tt   1 seq 100.00%      6 tok  0.31%  0s
<snip>
Checking 10.1515_zfrs-1985-0102.tt   ✓                                0s

I wonder about the 100% error rate though ...

inukshuk commented 2 years ago

The default model is trained using the 'core' dataset. The 'gold' is used mainly to ensure the quality when we update the model, but in general all references in core and gold have been checked by humans and there shouldn't be many errors. The other datasets are either unchecked parse results or datasets marked 'for training' on anystyle.io and the quality is pretty much unknown.

Using the CLI tool you need to supply a single file, but datasets are pretty easy to combine. Instead merging the files manually, I'd suggest using the Wapiti::Dataset class in an IRB session or a short script. Datasets implement the Comparable/Enumerable interfaces in Ruby so you can combine them using regular set operations. This way you won't have to worry about duplicates for example. In fact, that's just how the check function above works: it parses a tagged dataset and then computes the delta using the - operator. To combine two sets and drop duplicates you can use |.

In general, when you parse the dataset used to create your model you should expect close to 100% matching results. A high error rate typically means that there are inconsistencies either in your training data or in the feature extraction logic.

inukshuk commented 2 years ago

Our name parser interprets a single word as a given name. For references and western languages using family is arguably the better choice, but it's not an easy call to make without context. We could add an option to the name normalizer what to pick if there's only a single name part.

cboulanger commented 2 years ago

In general, when you parse the dataset used to create your model you should expect close to 100% matching results. A high error rate typically means that there are inconsistencies either in your training data or in the feature extraction logic.

However, how does "check" work with the finder model/ttx-annotations, since the finder doesn't deal with sequences or tokens, but just with lines - how can I then interpret the output of $ anystyle -F test/finder.mod check path/to/ttx-test/ in that case?

inukshuk commented 2 years ago

The finder model also deals with sequences and tokens. It's just that a line represents a token and a document a sequence (and a dataset is a set of sequences). Obviously the seq error rate is not very interesting for a single ttx file, but the token error rate tells you how many lines were labeled unexpectedly.