inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.
GNU General Public License v2.0
129 stars 30 forks source link

clean_pdf_file throws SystemError on MacOS with mmap: resizing not available #103

Open Hu1buerger opened 1 year ago

Hu1buerger commented 1 year ago

When running extract extract_references_from_file(path) on this file Kotti et al. - 2023 - Machine Learning for Software Engineering A Terti.pdf on MacOS Ventura 13.3.1 the following exception gets thrown.


    def clean_pdf_file(filename):
        """
        strip leading and/or trailing junk from a PDF file
        """
        with open(filename, 'r+b') as file, mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_WRITE) as mmfile:
            start = mmfile.find(b'%PDF-')
            if start == -1:
                # no PDF marker found
                LOGGER.debug('not a PDF file')
                return
            end = mmfile.rfind(b'%%EOF')
            offset = len(b'%%EOF')
            if start > 0:
                LOGGER.debug('moving and truncating')
                mmfile.move(0, start, end + offset - start)
                #mmfile.resize(end + offset - start)
                mmfile.flush()
            elif end > 0 and end + offset != mmfile.size():
                LOGGER.debug('truncating only')
>               mmfile.resize(end + offset - start)
E               SystemError: mmap: resizing not available--no mremap()

../venv/lib/python3.10/site-packages/refextract/references/engine.py:1412: SystemError
StevenZhang0116 commented 11 months ago

same problem here; do you have any idea on how to solve this?