monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
603 stars 75 forks source link

Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

Closed timalamenciak closed 3 months ago

timalamenciak commented 3 months ago

Trying to pull in the PDF from this article throws the below error: https://onlinelibrary.wiley.com/doi/10.1002/eco.1705

This has been tested on other PDFs to the same end.

ontogpt -vvv extract -t trek_2.yaml -i test1.pdf
INFO:root:Logger root set to level 10
INFO:root:Input file: test1.pdf
Traceback (most recent call last):
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Scripts\\ontogpt", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\Documents\Coding\TReK-OntoGPT\ontogpt\src\ontogpt\cli.py", line 329, in extract
    text = open(inputfile, "r").read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined>
timalamenciak commented 3 months ago

Update on this - I had the error crop up again when copying-and-pasting from a PDF, so I dug into the code. This block appears to be the challenge (lines 324-329 of cli.py):


        if use_textract:
            import textract

            text = textract.process(inputfile).decode("utf-8")
        else:
            text = open(inputfile, "r").read()

On my own version, I added an ignore flag to the text open file. This will ignore improperly formatted characters, which may lose data, but I think in this package's use case, that won't be crippling.


        if use_textract:
            import textract

            text = textract.process(inputfile).decode("utf-8")
        else:
            text = open(inputfile, "r", **errors="ignore"**).read()

Textract is still not working.

caufieldjh commented 3 months ago

Might just fix this with #421. In the meantime, I'll have a fix here shortly along the lines of what you suggest - though I don't recommend parsing entire PDFs with it unless you want to get a lot of unreadable characters.

caufieldjh commented 2 months ago

Hi @timalamenciak - give PDF parsing a try in v1.0.2 (just released) - it now uses the option --use-pdf instead of --use-textract

timalamenciak commented 2 months ago

Thrilling! That worked.

timalamenciak commented 2 months ago

Thanks @caufieldjh !