kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.32k stars 440 forks source link

Errors using the lightweight docker container (v0.7.3) #1014

Open gjreda opened 1 year ago

gjreda commented 1 year ago

Hi grobid team!

I'm running the lightweight version of grobid via the docker container. I'm using 0.7.3.

Starting the container via docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.3 works as expected and I'm able to load the web service at localhost:8070. However, when I load a PDF and submit the request, I get the error below

image

The docker container outputs the attached errors and stacktrace: upload-errors.txt

Maybe relatedly, when using the python client, the service seems to get called properly, but errors as seen below.

In [1]: from grobid_client.grobid_client import GrobidClient

In [2]: client = GrobidClient(grobid_server='http://localhost:8070')
GROBID server is up and running

In [3]: client.process('processHeaderDocument', "../gpt-pdf-bot/papers", output="./output", force=True)

Processing of ../gpt-pdf-bot/papers/Machine Learning at Scale.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin12017177478640739195.pdf
Processing of ../gpt-pdf-bot/papers/Software Engineering Practices for Machine Learning.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin15203762064446348646.pdf
Processing of ../gpt-pdf-bot/papers/A Few Useful Things to Know about Machine Learning.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin12500897149320232108.pdf
Processing of ../gpt-pdf-bot/papers/What’s your ML Test Score - A rubric for ML production systems.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin17788306898358675868.pdf
Processing of ../gpt-pdf-bot/papers/Rules of Machine Learning - Best Practices for ML Engineering.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin10395303085152685441.pdf
Processing of ../gpt-pdf-bot/papers/The ML Test Score - A Rubric for ML Production Readiness and Technical Debt Reduction.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin3260505895032721432.pdf
Processing of ../gpt-pdf-bot/papers/Operationalizing Machine Learning - An Interview Study.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin61813132851985820.pdf
Processing of ../gpt-pdf-bot/papers/Machine Learning - The High-Interest Credit Card of Technical Debt.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin9955097769034592732.pdf
Processing of ../gpt-pdf-bot/papers/Hidden Technical Debt in Machine Learning Systems.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin8850545049848981842.pdf

I can also see that the txt files are created in the output directory, though they are empty (makes sense given the errors).

greg@Gregs-MacBook-Air output % ls -la
total 72
drwxr-xr-x@ 11 greg  staff  352 May 16 17:07 .
drwxr-xr-x@  7 greg  staff  224 May 16 15:40 ..
-rw-r--r--@  1 greg  staff  114 May 16 17:27 A Few Useful Things to Know about Machine Learning_500.txt
-rw-r--r--@  1 greg  staff  113 May 16 17:27 Hidden Technical Debt in Machine Learning Systems_500.txt
-rw-r--r--@  1 greg  staff  113 May 16 17:27 Machine Learning - The High-Interest Credit Card of Technical Debt_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 Machine Learning at Scale_500.txt
-rw-r--r--@  1 greg  staff  111 May 16 17:27 Operationalizing Machine Learning - An Interview Study_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 Rules of Machine Learning - Best Practices for ML Engineering_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 Software Engineering Practices for Machine Learning_500.txt
-rw-r--r--@  1 greg  staff  113 May 16 17:27 The ML Test Score - A Rubric for ML Production Readiness and Technical Debt Reduction_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 What’s your ML Test Score - A rubric for ML production systems_500.txt

The docker container outputs the attached errors and stacktrace: api-errors.txt

Any idea what the underlying issue is? Am I calling the service improperly? Any help is very much appreciated!

lfoppiano commented 1 year ago

Hi @gjreda, this problem is strange, I've double checked my docker installation and it works fine. Could you give me some more details about your docker installation?

Another things that you could do is execute a bash on the existing container and try to run the pdfalto_server:

gjreda commented 1 year ago

Hi @lfoppiano thanks for the quick reply!

FWIW I'm on an M1 mac running macOS 13.3.1. I've also allocated 4 CPU and 4 GB of memory to docker.

greg@Gregs-MacBook-Air ~ % docker --version
Docker version 20.10.12, build e91ed57
greg@Gregs-MacBook-Air ~ % docker ps
CONTAINER ID   IMAGE                    COMMAND                  CREATED             STATUS             PORTS                    NAMES
98a83bb59614   lfoppiano/grobid:0.7.3   "./grobid-service/bi…"   About an hour ago   Up About an hour   0.0.0.0:8070->8070/tcp   interesting_hellman

The help menu for pdfalto_server successfully prints as well.

greg@Gregs-MacBook-Air ~ % docker exec -it 98a83bb59614  /bin/bash
root@98a83bb59614:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server
pdfalto version 0.5
Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : deprecated
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -noLineNumbers                : do not output line numbers added in manuscript-style textual documents
  -readingOrder                 : blocks follow the reading order
  -noText                       : do not extract textual objects (might be useful, but non-valid ALTO)
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -filesLimit <int>             : limit of asset files be extracted
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information

Happy to provide any other details that might be helpful!

lfoppiano commented 1 year ago

@gjreda if you change the grobid address in the client configuration to https://kermitt2-grobid.hf.space does it work?

could you try to run pdfalto with a document?

and let me know if ti works, you can use any pdf

gjreda commented 1 year ago

@gjreda if you change the grobid address in the client configuration to https://kermitt2-grobid.hf.space/ does it work?

This worked!

could you try to run pdfalto with a document?

This did not work and ultimately threw out the following error:

root@a9fe3565b220:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

Full details below

greg@Gregs-MacBook-Air grobid-demo % docker exec -it a9fe3565b220 /bin/bash
root@a9fe3565b220:/opt/grobid# apt-get update
Get:1 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:2 http://deb.debian.org/debian bullseye InRelease [116 kB]
Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [240 kB]
Get:5 http://deb.debian.org/debian bullseye/main amd64 Packages [8183 kB]
Get:6 http://deb.debian.org/debian bullseye-updates/main amd64 Packages [14.6 kB]
Fetched 8646 kB in 8s (1056 kB/s)
Reading package lists... Done

root@a9fe3565b220:/opt/grobid# apt-get install wget
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpsl5 publicsuffix
The following NEW packages will be installed:
  libpsl5 publicsuffix wget
0 upgraded, 3 newly installed, 0 to remove and 28 not upgraded.
Need to get 1149 kB of archives.
After this operation, 4001 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://deb.debian.org/debian bullseye/main amd64 libpsl5 amd64 0.21.0-1.2 [57.3 kB]
Get:2 http://deb.debian.org/debian bullseye/main amd64 wget amd64 1.21-1+deb11u1 [964 kB]
Get:3 http://deb.debian.org/debian bullseye/main amd64 publicsuffix all 20220811.1734-0+deb11u1 [127 kB]
Fetched 1149 kB in 0s (3393 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libpsl5:amd64.
(Reading database ... 7312 files and directories currently installed.)
Preparing to unpack .../libpsl5_0.21.0-1.2_amd64.deb ...
Unpacking libpsl5:amd64 (0.21.0-1.2) ...
Selecting previously unselected package wget.
Preparing to unpack .../wget_1.21-1+deb11u1_amd64.deb ...
Unpacking wget (1.21-1+deb11u1) ...
Selecting previously unselected package publicsuffix.
Preparing to unpack .../publicsuffix_20220811.1734-0+deb11u1_all.deb ...
Unpacking publicsuffix (20220811.1734-0+deb11u1) ...
Setting up libpsl5:amd64 (0.21.0-1.2) ...
Setting up wget (1.21-1+deb11u1) ...
Setting up publicsuffix (20220811.1734-0+deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u3) ...

root@a9fe3565b220:/opt/grobid# wget https://mdr.nims.go.jp/downloads/wd375x09x?locale=en -o /tmp/bao.pdf

root@a9fe3565b220:/opt/grobid# ls -la /tmp/
total 20
drwxrwxrwt 1 root root 4096 May 17 18:00 .
drwxr-xr-x 1 root root 4096 May 17 17:59 ..
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
drwxr-xr-x 1 root root 4096 May 17 17:59 hsperfdata_root

root@a9fe3565b220:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

root@a9fe3565b220:/opt/grobid# ls -la
total 788
drwxr-xr-x 1 root root   4096 May 17 18:00  .
drwxr-xr-x 1 root root   4096 May 15 03:50  ..
drwxr-xr-x 1 root root   4096 May 17 17:59  grobid-home
drwxr-xr-x 4 root root   4096 May 15 03:52  grobid-service
drwxr-xr-x 2 root root   4096 May 17 17:59  logs
-rw-r--r-- 1 root root 774523 May 17  2020 'wd375x09x?locale=en'

root@a9fe3565b220:/opt/grobid# ls -la /tmp/
total 20
drwxrwxrwt 1 root root 4096 May 17 18:06 .
drwxr-xr-x 1 root root 4096 May 17 17:59 ..
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
drwxr-xr-x 1 root root 4096 May 17 17:59 hsperfdata_root
lfoppiano commented 1 year ago

mmm checking the downloaded file size, there is something weird:

This is correct:

-rw-r--r-- 1 root root 774523 May 17  2020 'wd375x09x?locale=en'

This is too small:

-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf

Could you share there result of df -h?

gjreda commented 1 year ago
root@321d33972d1a:/opt/grobid# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          59G   24G   32G  44% /
tmpfs            64M     0   64M   0% /dev
shm              64M     0   64M   0% /dev/shm
/dev/vda1        59G   24G   32G  44% /etc/hosts
tmpfs           2.0G     0  2.0G   0% /sys/firmware
lfoppiano commented 1 year ago

I'm out of ideas. 🤔 I'll run it on my M1 later and let you know if I encounter any issue.

lfoppiano commented 1 year ago

@gjreda good news. I found the issue and is related to the M1. It seems that the fork mechanism does not work anymore (I did not understand why), anyway I had to add a parameter to the JDK: -Djdk.lang.Process.launchMechanism=vfork

I've pushed a new image lfoppiano/grobid:0.7.3-arm which should work on M1. Also, since it is still built for linux/amd64, I recommend you to update docker to the version >=4.17 and enable Rosetta: https://collabnix.com/warning-the-requested-images-platform-linux-amd64-does-not-match-the-detected-host-platform-linux-arm64-v8/

Could you try it out and let me know?

I'm sorry, at the moment I'm a bit short of time to provide a proper multiplatform image.

gjreda commented 1 year ago

@lfoppiano No need to apologize! I really appreciate your help.

The new image, upgrading docker, and enabling Rosetta got it working!

I'm still able to cause 500 errors if I request a larger batch - nine pdfs - on the first try, before the models have been loaded. This results in java.lang.OutOfMemoryError: Java heap space. However, if I immediately try the same batch of files, it works. I suspect it is the combination of both loading the models and requesting a larger batch that results in the OOM as this does not happen if my first request is small (1-3 pdfs).

Here is the python script I am using to test
```python from grobid_client.grobid_client import GrobidClient server = 'http://localhost:8070' client = GrobidClient(grobid_server=server, timeout=600) client.process( 'processHeaderDocument', input_path='docs', output='test', force=True, verbose=True ) ```

Another error that has popped up is rosetta error: futex(FUTEX_LOCK_PI_PRIVATE) failure: 35 in the container stdout, which expectedly breaks the client side connection, resulting in the below traceback. While I've seen this error a few times, I haven't been able to consistently reproduce it.

Broken connection traceback for client
``` Traceback (most recent call last): File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 536, in _make_request response = conn.getresponse() File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connection.py", line 454, in getresponse httplib_response = super().getresponse() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 1348, in getresponse response.begin() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 316, in begin version, status, reason = self._read_status() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 285, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/util/retry.py", line 470, in increment raise reraise(type(error), error, _stacktrace) File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/util/util.py", line 38, in reraise raise value.with_traceback(tb) File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 536, in _make_request response = conn.getresponse() File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connection.py", line 454, in getresponse httplib_response = super().getresponse() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 1348, in getresponse response.begin() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 316, in begin version, status, reason = self._read_status() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 285, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "main.py", line 6, in client.process( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/grobid_client.py", line 145, in process self.process_batch( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/grobid_client.py", line 212, in process_batch input_file, status, text = r.result() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/grobid_client.py", line 278, in process_pdf res, status = self.post( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/client.py", line 185, in post return self.call_api( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/client.py", line 121, in call_api r = requests.request( File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) ```

I'll follow up on this thread if I run into any more issues or figure out how to consistently reproduce the Rosetta error, but I think you've solved my issue. Thank you! I really appreciate your work.

lfoppiano commented 1 year ago

@gjreda thanks! I will do more tests in the following weeks and update the documentation accordingly. The support on M1 it's a bit of a grey area also for me too.

lfoppiano commented 1 year ago

I've done some more tests, I could process several PDFs till the servers stopped answering. There is something not working well in the interface with pdfalto and it's only a problem on M1.

For the OOM, I suggest you to add 2 more Gb of RAM, in general Grobid should run without problems with 4Gb, but it seems that with rosetta 4Gb are not enough.

We could solve all these problems with a arm64 build, however this will take some time.