inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.03k stars 88 forks source link

Improve performance in finding in-text references #166

Open alexander-winkler opened 3 years ago

alexander-winkler commented 3 years ago

Hello! I would like to analyse bibliographic references in a specific journal (so presumably rather homogeneous citation style) over several years. There is no separate bibliography section, but references are in the footnotes and often embedded in some context. For my use-case it would be important not detect as many bibliographic entries as possible, whereas I don't bother too much about the parsing precision. The performance of the default model is uneven:

The CLI misses, e.g. this one: github_issue_01

Corresponding pdftotext output:

10
Maya Mikdashi, “Sex and Sectarianism: The Legal Architecture of Lebanese Citizenship,” Comparative
Studies of South Asia, Africa, and the Middle East 34 (2014): 279–93.

But detects (and parses) these: github_issue_02

Corresponding pdftotext output:

32
See Ruba Salih, “Bodies That Walk, Bodies That Talk, Bodies That Love,” Antipode: A Radical Journal of
Geography 49 (2016): 742–60; Dina Georgis, The Better Story: Queer Affects from the Middle East (New York:
SUNY Press, 2013); Hanadi Al-Samman and Tarik El-Ariss, “Queer Affects: Introduction,” International
Journal of Middle East Studies 45 (2013): 205–9.
33
Morrison et al, Critical Geographies, 507.
34

Is there any way to increase recall? If this can be done by on annotated material, could anybody tell me what the annotation of the ttx should look like? The examples in the res/finder are, if I haven't missed anything, mostly not in-text, but rather one reference per line.

Thanks a lot!

inukshuk commented 3 years ago

References in footnotes are indeed a sore spot of the finder model. The default model contains a few publications with footnotes (due to copyright issues the ttx files in the repository here are not complete) but I believe it could do sufficiently well for styles that have full references in the footnotes. If you have a given citation style you could tweak the feature-set of the finder model (that is, the characteristics of each line that are extracted) to give more weight to, for example the 'See' at the beginning of a reference or the quotation marks. Using the CLI it would also be important to use the --solo switch.

It's more problematic if, as is often the case, the footnotes contain more free-form text. For instance, something like: 'See Ruba Salih, [...], who also points this this fact. For an opposing view, see [...].' These are difficult, because the finder model operates on full-line tokens and while references can span multiple lines, it assumes that a given line is either mostly a reference or other text, not both. I'm not sure if that's what you mean by 'in-text' references; the examples above look like you have single reference per footnote, which should be more practical given the current model.

I'm afraid that's not very concrete advice. I guess what I would suggest is to train 5-10 articles, which is pretty time consuming (but you can start by parsing and saving as ttx first, and then making changes to the ttx and creating a new model with it) in order to evaluate if the results are only a few tweaks away from success or if another model/approach is needed.

I'm linking this to a previous discussion at #129 that might be of interest.