inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

RefExtract: no hyphen between first and last page #3555

Open ksachs opened 5 years ago

ksachs commented 5 years ago

Just for the records. I don't know if this is an issue for labs.

Expected Behavior

If there is a line-break in references between first page and last page (/\d+\-\n\d+) keep the hyphen, don't delete it due to the line-break.

Current Behavior on legacy

If there is a line-break in the page-range, legacy swallows the hyphen and concatenates first- and last-page. E.g.

F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale in-
variance explains everything, Astronomy and Astrophysics 282,262-268 (1994);
II.Build your own law from disk models, Astronomy and Astrophysics 282,269-
276 (1994).

-> Astron.Astrophys., 282,269276

michamos commented 5 years ago

It's a bug still present in refextract:

In [1]: from refextract import extract_references_from_string

In [2]: extract_references_from_string('''F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale in-
   ...: variance explains everything, Astronomy and Astrophysics 282,262-268 (1994);
   ...: II.Build your own law from disk models, Astronomy and Astrophysics 282,269-
   ...: 276 (1994).''')
* could not find references section
* references separator ^[^\s]
* tags u'<cds.AUTHstnd>F.Graner and B.Dubrulle</cds.AUTHstnd>, Titius-Bode laws in the solar system : I.scale invariance explains everything, <cds.JOURNAL>Astron. Astrophys.</cds.JOURNAL> <cds.VOL>282</cds.VOL> <cds.YR>(1994)</cds.YR> <cds.PG>262-268</cds.PG>;'
* splitted_citations
  * line marker  
  * elements
    * AUTH {'auth_type': 'stnd', 'auth_txt': u'F.Graner and B.Dubrulle', 'type': 'AUTH', 'misc_txt': u''}
    * JOURNAL {'volume': u'282', 'is_ibid': False, 'title': u'Astron. Astrophys.', 'extra_ibids': [], 'year': u'1994', 'type': 'JOURNAL', 'misc_txt': u', Titius-Bode laws in the solar system : I.scale invariance explains everything, ', 'page': u'262-268'}
    * YEAR {'type': 'YEAR', 'misc_txt': '', 'year': u'1994'}
* tags u'II.Build your own law from disk models, <cds.JOURNAL>Astron. Astrophys.</cds.JOURNAL> <cds.VOL>282</cds.VOL> <cds.YR>(1994)</cds.YR> <cds.PG>269276</cds.PG>.'
* splitted_citations
  * line marker  
  * elements
    * JOURNAL {'volume': u'282', 'is_ibid': False, 'title': u'Astron. Astrophys.', 'extra_ibids': [], 'year': u'1994', 'type': 'JOURNAL', 'misc_txt': u'II.Build your own law from disk models, ', 'page': u'269276'}
    * YEAR {'type': 'YEAR', 'misc_txt': '', 'year': u'1994'}
Out[2]: 
[{'author': [u'F.Graner and B.Dubrulle'],
  'journal_page': [u'262-268'],
  'journal_reference': ['Astron. Astrophys. 282 (1994) 262-268'],
  'journal_title': [u'Astron. Astrophys.'],
  'journal_volume': [u'282'],
  'journal_year': [u'1994'],
  'misc': [u'Titius-Bode laws in the solar system : I.scale invariance explains everything'],
  'raw_ref': ['F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale invariance explains everything, Astronomy and Astrophysics 282,262-268 (1994);'],
  'year': [u'1994']},
 {'journal_page': [u'269276'],
  'journal_reference': ['Astron. Astrophys. 282 (1994) 269276'],
  'journal_title': [u'Astron. Astrophys.'],
  'journal_volume': [u'282'],
  'journal_year': [u'1994'],
  'misc': [u'II.Build your own law from disk models'],
  'raw_ref': ['II.Build your own law from disk models, Astronomy and Astrophysics 282,269276 (1994).'],
  'year': [u'1994']}]

As you see, it's present already in the raw_ref, so the bug must be somewhere in the text pre-processing in https://github.com/inspirehep/refextract/blob/master/refextract/references/text.py. I don't think we have the resources here to fix refextract bugs (as it's very complex, nobody here knows how it works, and in the future we will probably to switch, at least partially, to something like GROBID), but you're free to have a look if you want to.

ksachs commented 5 years ago

should be possible to prevent that bug by deleting linebreaks in page-ranges before referxtract handles page-breaks: fulltext = re.sub(r'(\d)-\n(\d)', r'\1-\2', fulltext)