Open ksachs opened 5 years ago
It's a bug still present in refextract:
In [1]: from refextract import extract_references_from_string
In [2]: extract_references_from_string('''F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale in-
...: variance explains everything, Astronomy and Astrophysics 282,262-268 (1994);
...: II.Build your own law from disk models, Astronomy and Astrophysics 282,269-
...: 276 (1994).''')
* could not find references section
* references separator ^[^\s]
* tags u'<cds.AUTHstnd>F.Graner and B.Dubrulle</cds.AUTHstnd>, Titius-Bode laws in the solar system : I.scale invariance explains everything, <cds.JOURNAL>Astron. Astrophys.</cds.JOURNAL> <cds.VOL>282</cds.VOL> <cds.YR>(1994)</cds.YR> <cds.PG>262-268</cds.PG>;'
* splitted_citations
* line marker
* elements
* AUTH {'auth_type': 'stnd', 'auth_txt': u'F.Graner and B.Dubrulle', 'type': 'AUTH', 'misc_txt': u''}
* JOURNAL {'volume': u'282', 'is_ibid': False, 'title': u'Astron. Astrophys.', 'extra_ibids': [], 'year': u'1994', 'type': 'JOURNAL', 'misc_txt': u', Titius-Bode laws in the solar system : I.scale invariance explains everything, ', 'page': u'262-268'}
* YEAR {'type': 'YEAR', 'misc_txt': '', 'year': u'1994'}
* tags u'II.Build your own law from disk models, <cds.JOURNAL>Astron. Astrophys.</cds.JOURNAL> <cds.VOL>282</cds.VOL> <cds.YR>(1994)</cds.YR> <cds.PG>269276</cds.PG>.'
* splitted_citations
* line marker
* elements
* JOURNAL {'volume': u'282', 'is_ibid': False, 'title': u'Astron. Astrophys.', 'extra_ibids': [], 'year': u'1994', 'type': 'JOURNAL', 'misc_txt': u'II.Build your own law from disk models, ', 'page': u'269276'}
* YEAR {'type': 'YEAR', 'misc_txt': '', 'year': u'1994'}
Out[2]:
[{'author': [u'F.Graner and B.Dubrulle'],
'journal_page': [u'262-268'],
'journal_reference': ['Astron. Astrophys. 282 (1994) 262-268'],
'journal_title': [u'Astron. Astrophys.'],
'journal_volume': [u'282'],
'journal_year': [u'1994'],
'misc': [u'Titius-Bode laws in the solar system : I.scale invariance explains everything'],
'raw_ref': ['F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale invariance explains everything, Astronomy and Astrophysics 282,262-268 (1994);'],
'year': [u'1994']},
{'journal_page': [u'269276'],
'journal_reference': ['Astron. Astrophys. 282 (1994) 269276'],
'journal_title': [u'Astron. Astrophys.'],
'journal_volume': [u'282'],
'journal_year': [u'1994'],
'misc': [u'II.Build your own law from disk models'],
'raw_ref': ['II.Build your own law from disk models, Astronomy and Astrophysics 282,269276 (1994).'],
'year': [u'1994']}]
As you see, it's present already in the raw_ref
, so the bug must be somewhere in the text pre-processing in https://github.com/inspirehep/refextract/blob/master/refextract/references/text.py. I don't think we have the resources here to fix refextract bugs (as it's very complex, nobody here knows how it works, and in the future we will probably to switch, at least partially, to something like GROBID), but you're free to have a look if you want to.
should be possible to prevent that bug by deleting linebreaks in page-ranges
before referxtract handles page-breaks:
fulltext = re.sub(r'(\d)-\n(\d)', r'\1-\2', fulltext)
Just for the records. I don't know if this is an issue for labs.
Expected Behavior
If there is a line-break in references between first page and last page (
/\d+\-\n\d+
) keep the hyphen, don't delete it due to the line-break.Current Behavior on legacy
If there is a line-break in the page-range, legacy swallows the hyphen and concatenates first- and last-page. E.g.
-> Astron.Astrophys., 282,269276