inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.06k stars 91 forks source link

CRF / precision-recall? #164

Open silviaegt opened 3 years ago

silviaegt commented 3 years ago

I'm currently trying to find the best way to go to perform citation extraction from my 3,000 corpus of electronic dissertations. I'm considering "anystyle" but haven't been able to understand better your CRF approach (probably because I'm not well versed in Ruby), so if you have any advice on a blogpost I should read I would be very grateful!

inukshuk commented 3 years ago

AnyStyle was inspired by [this ParsCit paper]. The idea is to segment the references into tokens (basically words plus punctuation) and extract a number of 'features' of each token. Feature extraction is fully deterministic and for the most part trivial (e.g. 'the token is numeric', 'the token has trailing punctuation', etc.,). These tokens and features are fed into a CRF model to predict an appropriate label for each token. Afterwards, the successive tokens with the same label are joined together and then the results are optionally normalized (e.g., punctuation removed, author strings parsed into individual names using a dedicated name parser and so on).

The finder model (used for extraction of references from full texts) works similarly, but here the tokens are full lines instead of individual words. Since references are typically grouped in sections at the end of chapters or the entire document, the finder features are mostly concerned in finding such sections and ignoring meta information, such as page numbers, headers, and footers). Once reference sections are found, there is a deterministic / hacked-together code that tries normalize the references so that there is one reference per line -- these can then passed to the parser model.

Hope that helps!