jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Segmentation Fault in running tests #869

Closed petermr closed 1 year ago

petermr commented 1 year ago

I have cloned the latest PdfPlumber and ran tests (Macos). I get a segfault

git clone https://github.com/jsvine/pdfplumber.git
Cloning into 'pdfplumber'...
remote: Enumerating objects: 2756, done.
remote: Counting objects: 100% (754/754), done.
remote: Compressing objects: 100% (232/232), done.
remote: Total 2756 (delta 565), reused 593 (delta 520), pack-reused 2002
Receiving objects: 100% (2756/2756), 16.46 MiB | 7.16 MiB/s, done.
Resolving deltas: 100% (1785/1785), done.
(base) pm286macbook-2:pdfplumber1 pm286$ cd pdfplumber/

(base) pm286macbook-2:pdfplumber pm286$ python -m unittest discover tests
E.EEE...EEEEE.EEEEEEEE.EEEEEEEEEEE.EE/opt/anaconda3/lib/python3.8/site-packages/pdfminer/psparser.py:590: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/pdfs/pdffill-demo.pdf'>
  objs = [obj for (_, obj) in self.curstack]
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.EEE/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_display.py:62: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/pdfs/nics-background-checks-2015-11.pdf'>
  page = pdfplumber.PDF(io.BytesIO(open(path, "rb").read())).pages[0]
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Segmentation fault: 11

(I am using previous versions of PDFPlumber so it's possible I may have libraries which are incompatible) Many thanks

petermr commented 1 year ago

Update. I have pip install the latest pdfplumber and get a similar error but no segfault:


(fails 62/118 tests)

(base) pm286macbook-2:pdfplumber pm286$ python -m unittest discover tests
E.EEE...EEEEE.EEEEEEEE.EEEEEEEEEEE.EE/opt/anaconda3/lib/python3.8/site-packages/pdfminer/psparser.py:592: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/pdfs/pdffill-demo.pdf'>
  objs = [obj for (_, obj) in self.curstack]
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.EEE/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_display.py:62: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/pdfs/nics-background-checks-2015-11.pdf'>
  page = pdfplumber.PDF(io.BytesIO(open(path, "rb").read())).pages[0]
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.EE/Users/pm286/workspace/pdfplumber1/pdfplumber/pdfplumber/utils/pdfinternals.py:74: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/pdfs/issue-71-duplicate-chars-2.pdf'>
  return type(x)(resolve_all(v) for v in x)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.EEEEE..../opt/anaconda3/lib/python3.8/site-packages/pdfminer/converter.py:218: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/../examples/pdfs/ag-energy-round-up-2017-02-24.pdf'>
  item = LTChar(
ResourceWarning: Enable tracemalloc to get the object allocation traceback
................E.EE.EEEEE..E....E......E./Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_utils.py:203: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/comparisons/scotus-transcript-p1.txt' mode='r' encoding='UTF-8'>
  open(os.path.join(HERE, "comparisons/scotus-transcript-p1.txt"))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
E/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_utils.py:217: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/comparisons/scotus-transcript-p1-cropped.txt' mode='r' encoding='UTF-8'>
  open(os.path.join(HERE, "comparisons/scotus-transcript-p1-cropped.txt"))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
EEEE....E....EEEEE...
======================================================================
ERROR: test_annots (test_basics.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_basics.py", line 52, in test_annots
    pdf = self.pdf_2
AttributeError: 'Test' object has no attribute 'pdf_2'

======================================================================
ERROR: test_colors (test_basics.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_basics.py", line 152, in test_colors
    rect = self.pdf.pages[0].rects[0]
AttributeError: 'Test' object has no attribute 'pdf'

======================================================================
ERROR: test_crop_and_filter (test_basics.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pm286/workspace/pdfplumber1/pdfplumber/tests/test_basics.py", line 67, in test_crop_and_filter
    original = self.pdf.pages[0]
AttributeError: 'Test' object has no attribute 'pdf'
...
petermr commented 1 year ago

I run the tests in Pycharm and all but one pass, so I think it's a Python library problem and not worth spending time on.

jsvine commented 1 year ago

Hi, and thanks for your interest in this library, especially to the point of running tests. pdfplumber, however, uses pytest rather than unittest. You can run the tests via python -m pytest or make tests. Do you still get a segfault when you run that?

petermr commented 1 year ago

Thanks for the ultra-speedy response! I'll try pytest.

petermr commented 1 year ago

All except 1 pass (same in Pycharm) (the error looks like one of those fragile numbers that depend on "hidden variables" vary between runs)

(base) pm286macbook-2:pdfplumber pm286$ python -m pytest
================================================= test session starts ==================================================
platform darwin -- Python 3.8.3, pytest-7.1.2, pluggy-0.13.1
rootdir: /Users/pm286/workspace/pdfplumber1/pdfplumber, configfile: setup.cfg
plugins: cov-3.0.0
collected 118 items                                                                                                    

tests/test_basics.py .................                                                                           [ 14%]
tests/test_ca_warn_report.py .....                                                                               [ 18%]
tests/test_convert.py ............                                                                               [ 28%]
tests/test_ctm.py .                                                                                              [ 29%]
tests/test_dedupe_chars.py ....                                                                                  [ 33%]
tests/test_display.py F..........                                                                                [ 42%]
tests/test_issues.py ....................                                                                        [ 59%]
tests/test_laparams.py ....                                                                                      [ 62%]
tests/test_list_metadata.py .                                                                                    [ 63%]
tests/test_nics_report.py .....                                                                                  [ 67%]
tests/test_table.py ...........                                                                                  [ 77%]
tests/test_utils.py ...........................                                                                  [100%]

======================================================= FAILURES =======================================================
_________________________________________________ Test.test__repr_png_ _________________________________________________

self = <test_display.Test testMethod=test__repr_png_>

    def test__repr_png_(self):
        png = self.im._repr_png_()
        assert isinstance(png, bytes)
>       assert len(png) in (
            71939,
            61247,
        )  # PNG encoder seems to work differently on different setups
E       AssertionError: assert 71983 in (71939, 61247)
E        +  where 71983 = len(b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x03\xf0\x00\x00\x02d\x08\x02\x00\x00\x009\xbd]\xb8\x00\x01\x00\x00IDATx\...0\x00\x00\x00\x00@\x83AB\x0f\x00\x00\x00\x00\x00@\x83\xf9\xff\xc3\xf3\xbd\\\xff\x1e8\x11\x00\x00\x00\x00IEND\xaeB`\x82')

tests/test_display.py:93: AssertionError
=================================================== warnings summary ===================================================
../../../../../opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21
../../../../../opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21
  /opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    _np_version_forbids_neg_powint = LooseVersion(numpy.__version__) >= LooseVersion('1.12.0b1')

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

---------- coverage: platform darwin, python 3.8.3-final-0 -----------
Name                               Stmts   Miss  Cover
------------------------------------------------------
pdfplumber/__init__.py                 7      0   100%
pdfplumber/_typing.py                  8      0   100%
pdfplumber/_version.py                 2      0   100%
pdfplumber/cli.py                     34      0   100%
pdfplumber/container.py              112      0   100%
pdfplumber/convert.py                 56      0   100%
pdfplumber/ctm.py                     27      0   100%
pdfplumber/display.py                164      0   100%
pdfplumber/page.py                   255      0   100%
pdfplumber/pdf.py                     88      0   100%
pdfplumber/table.py                  321      0   100%
pdfplumber/utils/__init__.py           5      0   100%
pdfplumber/utils/clustering.py        36      0   100%
pdfplumber/utils/generic.py           11      0   100%
pdfplumber/utils/geometry.py         128      0   100%
pdfplumber/utils/pdfinternals.py      48      0   100%
pdfplumber/utils/text.py             230      0   100%
------------------------------------------------------
TOTAL                               1532      0   100%
Coverage XML written to file coverage.xml

=============================================== short test summary info ================================================
FAILED tests/test_display.py::Test::test__repr_png_ - AssertionError: assert 71983 in (71939, 61247)
====================================== 1 failed, 117 passed, 2 warnings in 22.68s ======================================
(base) pm286macbook-2:pdfplumber pm286$ 
jsvine commented 1 year ago

Ah, very interesting, and thanks for sharing this. Seems related to how PNGs are encoded on different platforms. Just pushed a fix. Hopefully should pass if you repull and run again.

petermr commented 1 year ago

FWIW I do quite a lot with images and find that things like byte counts can vary between runs. I often write things like:

assert 71950 > len(png) > 71930

BTW I am excited at how much has been added since 0.7.4 and I'm about to get familiarised. We have a small team who are converting the UN IPCC reports (ca 10,000 pages of PDF) and the new features will be really useful. Impressed with the amount of table extraction. All Open Source, Volunteers very welcome!

petermr commented 1 year ago

Now works!

=================================================== warnings summary ===================================================
../../../../../opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21
../../../../../opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21
  /opt/anaconda3/lib/python3.8/site-packages/numexpr/expressions.py:21: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    _np_version_forbids_neg_powint = LooseVersion(numpy.__version__) >= LooseVersion('1.12.0b1')

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

---------- coverage: platform darwin, python 3.8.3-final-0 -----------
Name                               Stmts   Miss  Cover
------------------------------------------------------
pdfplumber/__init__.py                 7      0   100%
pdfplumber/_typing.py                  8      0   100%
pdfplumber/_version.py                 2      0   100%
pdfplumber/cli.py                     34      0   100%
pdfplumber/container.py              112      0   100%
pdfplumber/convert.py                 56      0   100%
pdfplumber/ctm.py                     27      0   100%
pdfplumber/display.py                164      0   100%
pdfplumber/page.py                   255      0   100%
pdfplumber/pdf.py                     88      0   100%
pdfplumber/table.py                  321      0   100%
pdfplumber/utils/__init__.py           5      0   100%
pdfplumber/utils/clustering.py        36      0   100%
pdfplumber/utils/generic.py           11      0   100%
pdfplumber/utils/geometry.py         128      0   100%
pdfplumber/utils/pdfinternals.py      48      0   100%
pdfplumber/utils/text.py             230      0   100%
------------------------------------------------------
TOTAL                               1532      0   100%
Coverage XML written to file coverage.xml

=========================================== 118 passed, 2 warnings in 22.03s ===========================================
(base) pm286macbook-2:pdfplumber pm286$ 

BTW congratulations on not only what you have done but also getting a responsible user community

jsvine commented 1 year ago

Great, thanks! And thanks for the kind words. Eager to learn more about your parsing project. Feel free to reach out via email. (Address in my bio.)