CRF / precision-recall?

AnyStyle was inspired by [this ParsCit paper]. The idea is to segment the references into tokens (basically words plus punctuation) and extract a number of 'features' of each token. Feature extraction is fully deterministic and for the most part trivial (e.g. 'the token is numeric', 'the token has trailing punctuation', etc.,). These tokens and features are fed into a CRF model to predict an appropriate label for each token. Afterwards, the successive tokens with the same label are joined together and then the results are optionally normalized (e.g., punctuation removed, author strings parsed into individual names using a dedicated name parser and so on).

The finder model (used for extraction of references from full texts) works similarly, but here the tokens are full lines instead of individual words. Since references are typically grouped in sections at the end of chapters or the entire document, the finder features are mostly concerned in finding such sections and ignoring meta information, such as page numbers, headers, and footers). Once reference sections are found, there is a deterministic / hacked-together code that tries normalize the references so that there is one reference per line -- these can then passed to the parser model.

Hope that helps!

inukshuk / anystyle

CRF / precision-recall? #164