Closed Apurv3377 closed 3 years ago
I don't think the issue is related to the layout. Two-column layout should work just fine usually. Refextract is not meant to be a general-purpose reference extraction tool but has been tuned to work well for High-Energy Physics and related fields. If citations styles are very different, it will get into trouble. In this case, I believe it's due to the heading being called Bibliographic references
which is not expected: https://github.com/inspirehep/refextract/blob/24418cd2e31eae8e0d622f7afd2df9d7c34bfda3/refextract/references/regexs.py#L696-L710. Additionally, there are no markers at the beginning of each reference, so it might struggle to separate them. If you're looking for a general-purpose tool, I would look into https://github.com/kermitt2/grobid instead.
Thanks for the prompt and elaborated response. It answers the other doubt I had also. :)
Input PDF has two-columned layout. Refextract outputs empty array of references.
Input PDF has one-columned layout. Refextract works fine.
How can I allow refextract to parse both type of layouts?
Thank you.