inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.
GNU General Public License v2.0
130 stars 30 forks source link

authors: catastrophic backtracking in regex #26

Open jacquerie opened 7 years ago

jacquerie commented 7 years ago

How to reproduce:

>>> from refextract import extract_references_from_string
>>> extract_references_from_string('G. W. and L. B. and M. M. G. and T. A. and E. L. I. and E. P. and X. M. and B. Urbaszek, Magneto-optics in transition metal diselenide monolayers. 2D Mater. 2, 34002 (2015).')

this hangs refextract for, at least, days.

The reason appears to be catastrophic backtracking in this regex: https://github.com/inspirehep/refextract/blob/27588da5611f34266fd54fdbf8784814fffa0e7b/refextract/authors/regexs.py#L491-L494.

david-caro commented 7 years ago

This is the article that causes the issue, it should be reharvested once this is fixed: arXiv:1704.00841

kaplun commented 7 years ago

@tsgit are you by chance going to work on this issue in the near future? For the time being we have a workaround, but the approach you outlined in chat sounded way better than a workaround.

tsgit commented 7 years ago

@kaplun yes, very high on my todo list. unfortunately got pushed back by AAHEP, vacation, surgery and some other business -- by next week!