py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.29k stars 1.4k forks source link

`UnboundLocalError` error when extracting text #2933

Closed vodkar closed 10 hours ago

vodkar commented 10 hours ago

I was trying to parse a pdf paper. When I extract a text from it, I got the error: UnboundLocalError: cannot access local variable 'v' where it is not associated with a value

Environment

ARM MacOS 15.1, Python 3.12.2, pypdf == 5.1.0

$ python -m platform
macOS-15.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf

pdf = pypdf.PdfReader("2305.09315.pdf")
for page in pdf.pages:
    print(page.page_number)
    print(page.extract_text())

2305.09315.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/Users/somen/.pyenv/versions/3.12.2/lib/python3.12/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/.pyenv/versions/3.12.2/lib/python3.12/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/Users/somen/.vscode/extensions/ms-python.debugpy-2024.12.0-darwin-arm64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
    cli.main()
  File "/Users/somen/.vscode/extensions/ms-python.debugpy-2024.12.0-darwin-arm64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
    run()
  File "/Users/somen/.vscode/extensions/ms-python.debugpy-2024.12.0-darwin-arm64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/Users/somen/.vscode/extensions/ms-python.debugpy-2024.12.0-darwin-arm64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/.vscode/extensions/ms-python.debugpy-2024.12.0-darwin-arm64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/Users/somen/.vscode/extensions/ms-python.debugpy-2024.12.0-darwin-arm64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/test.py", line 6, in <module>
    print(page.extract_text())
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 34, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 57, in build_char_map_from_dict
    encoding, map_dict = get_encoding(ft)
                         ^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 130, in get_encoding
    map_dict, int_entry = _parse_to_unicode(ft)
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 213, in _parse_to_unicode
    return _type1_alternative(ft, map_dict, int_entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/somen/Zavodi/unik/llm4codesec_literature_review/.venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 531, in _type1_alternative
    map_dict[chr(i)] = v
                       ^
UnboundLocalError: cannot access local variable 'v' where it is not associated with a value
stefan6419846 commented 10 hours ago

Duplicate of #2925.