inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.
GNU General Public License v2.0
130 stars 30 forks source link

Refextract fails to extract from two-columned layout pdf #85

Closed Apurv3377 closed 3 years ago

Apurv3377 commented 3 years ago

Input PDF has two-columned layout. Refextract outputs empty array of references.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1710.11035.pdf')
print(references[0])

Input PDF has one-columned layout. Refextract works fine.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1509.03588.pdf')
print(references[0])

How can I allow refextract to parse both type of layouts?

Thank you.

michamos commented 3 years ago

I don't think the issue is related to the layout. Two-column layout should work just fine usually. Refextract is not meant to be a general-purpose reference extraction tool but has been tuned to work well for High-Energy Physics and related fields. If citations styles are very different, it will get into trouble. In this case, I believe it's due to the heading being called Bibliographic references which is not expected: https://github.com/inspirehep/refextract/blob/24418cd2e31eae8e0d622f7afd2df9d7c34bfda3/refextract/references/regexs.py#L696-L710. Additionally, there are no markers at the beginning of each reference, so it might struggle to separate them. If you're looking for a general-purpose tool, I would look into https://github.com/kermitt2/grobid instead.

Apurv3377 commented 3 years ago

Thanks for the prompt and elaborated response. It answers the other doubt I had also. :)