Open alexander-winkler opened 3 years ago
References in footnotes are indeed a sore spot of the finder model. The default model contains a few publications with footnotes (due to copyright issues the ttx files in the repository here are not complete) but I believe it could do sufficiently well for styles that have full references in the footnotes. If you have a given citation style you could tweak the feature-set of the finder model (that is, the characteristics of each line that are extracted) to give more weight to, for example the 'See' at the beginning of a reference or the quotation marks. Using the CLI it would also be important to use the --solo
switch.
It's more problematic if, as is often the case, the footnotes contain more free-form text. For instance, something like: 'See Ruba Salih, [...], who also points this this fact. For an opposing view, see [...].' These are difficult, because the finder model operates on full-line tokens and while references can span multiple lines, it assumes that a given line is either mostly a reference or other text, not both. I'm not sure if that's what you mean by 'in-text' references; the examples above look like you have single reference per footnote, which should be more practical given the current model.
I'm afraid that's not very concrete advice. I guess what I would suggest is to train 5-10 articles, which is pretty time consuming (but you can start by parsing and saving as ttx first, and then making changes to the ttx and creating a new model with it) in order to evaluate if the results are only a few tweaks away from success or if another model/approach is needed.
I'm linking this to a previous discussion at #129 that might be of interest.
Hello! I would like to analyse bibliographic references in a specific journal (so presumably rather homogeneous citation style) over several years. There is no separate bibliography section, but references are in the footnotes and often embedded in some context. For my use-case it would be important not detect as many bibliographic entries as possible, whereas I don't bother too much about the parsing precision. The performance of the default model is uneven:
The CLI misses, e.g. this one:
Corresponding
pdftotext
output:But detects (and parses) these:
Corresponding
pdftotext
output:Is there any way to increase recall? If this can be done by on annotated material, could anybody tell me what the annotation of the ttx should look like? The examples in the
res/finder
are, if I haven't missed anything, mostly not in-text, but rather one reference per line.Thanks a lot!