inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.
GNU General Public License v2.0
130 stars 30 forks source link

mmap: resizing not available #86

Closed StolkArjen closed 2 years ago

StolkArjen commented 2 years ago

Hi all,

Has anyone experienced the following issue in Python 3.8 with version 1.1.2 (on macOS), or know what's causing it?

from refextract import extract_references_from_url references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')

  File "/Users/xxx/anaconda3/lib/python3.8/site-packages/refextract/references/engine.py", line 1412, in clean_pdf_file
    mmfile.resize(end + offset - start)
SystemError: mmap: resizing not available--no mremap()

What's surprising is that on line 1412 of engine.py, the mmfile object does appear to have the resize function variable. Unlike with other variables like flush(), however, its execution cannot be completed somehow. I cannot find the mremap function variable anywhere either.

Thanks

StolkArjen commented 2 years ago

A related workaround: https://github.com/pyqtgraph/pyqtgraph/commit/a8d3aad97a4895b61b6cddd733f0cea4f82f38b1#

tsgit commented 2 years ago

As far as I am concerned the python module mmap should address OS dependencies, thus strictly speaking this isn't a refextract issue.

Anyhow, in refextract mmap is simply used for memory and space efficient stripping of leading and/or trailing junk around a PDF file

https://github.com/inspirehep/refextract/blob/master/refextract/references/engine.py#L1393-L1413

it isn't strictly necessary for the core functionality of extracting references from a clean PDF file, so you could just bypass it.

StolkArjen commented 2 years ago

I think you're right, this issue appears due to mmap. I've been going down the route you suggested, i.e. commenting out the resize operations, and so far so good. Thanks

StolkArjen commented 2 years ago

"bypassed"