kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Grobid processes the entire text sometimes but other times it doesn't process correctly and returns None as the output #1196

Closed Odrec closed 1 week ago

Odrec commented 2 weeks ago

Operating System and architecture (arm64, amd64, x86, etc.)

Ubuntu Linux x86_64

What is your Java version

openjdk 11.0.24 2024-07-16

Log and information

No response

Further information

This is the error shows up sometimes with my app for the same pdfs that were working before. I'm not sure why it works sometimes and not others. This is the relevant part of the code where it tries to extract the entire text but it returns None as the output.

    # Process the PDF using Grobid's processFulltextDocument for full text
    fulltext_response = client.process_pdf(
        "processFulltextDocument",
        tmp_file_name,
        generateIDs=True,
        consolidate_header=True,
        consolidate_citations=True,
        include_raw_citations=True,
        include_raw_affiliations=True,
        tei_coordinates=True,
        segment_sentences=True
    )

I restarted the grobid server several times and still gives me the same error but it was working yesterday. Does anyone know why the result from the

TypeError: a bytes-like object is required, not 'NoneType'
Traceback:
File "PycharmProjects/chat-with-docs/.venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "PycharmProjects/chat-with-docs/.venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 579, in code_to_exec
    exec(code, module.__dict__)
File "/home/odrec/PycharmProjects/chat-with-docs/test.py", line 256, in <module>
    title, authors, abstract, document_text, images_with_captions = process_pdf(pdf_bytes)
                                                                    ^^^^^^^^^^^^^^^^^^^^^^
File "PycharmProjects/chat-with-docs/test.py", line 205, in process_pdf
    root_fulltext = ET.fromstring(fulltext_xml_content)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/xml/etree/ElementTree.py", line 1335, in XML
    parser.feed(text)
lfoppiano commented 2 weeks ago

Hi @odrec, could you look for the logs in the grobid server? Which type of documents are you processing?

The process_pdf should return a tuple with two elements, the status code and the content, so you should check the status code == 200 before trying to access the data.

Odrec commented 2 weeks ago

Hi @Odrec, could you look for the logs in the grobid server? Which type of documents are you processing?

The process_pdf should return a tuple with two elements, the status code and the content, so you should check the status code == 200 before trying to access the data.

Hey! This is what process_pdf returned

('/tmp/tmpbweiaosz.pdf', 408, None)

I'll try to get the logs now and report back

Also, I've tried only with scientific papers. For example this one fails everytime for me.

Odrec commented 2 weeks ago

Hi @Odrec, could you look for the logs in the grobid server? Which type of documents are you processing?

The process_pdf should return a tuple with two elements, the status code and the content, so you should check the status code == 200 before trying to access the data.

I'm running in a docker container. How can I access the logs? I don't see the logs directory

root@9fbe59fc8f3f:/opt/grobid# ls
data  delft  grobid-home  grobid-service  preload_embeddings.py  resources-registry.json
lfoppiano commented 2 weeks ago

@Odrec you should get the log in the docker console and at least the error, if any. Could you process the same PDF from the grobid interface?

408 indicate request timeout, so it might be that you are having other issues related to your network.

Odrec commented 1 week ago

Somehow this works with my remote server but not if I run grobid locally on my laptop so I'll close it for now while I see what is the problem on my local instance. Thanks for the help!