Open gjreda opened 1 year ago
Hi @gjreda, this problem is strange, I've double checked my docker installation and it works fine. Could you give me some more details about your docker installation?
Another things that you could do is execute a bash on the existing container and try to run the pdfalto_server:
docker ps
should give you the list of running containersdocker exec -it {container_hash} /bin/bash
/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server
to run pdfalto_server, should print the help menuHi @lfoppiano thanks for the quick reply!
FWIW I'm on an M1 mac running macOS 13.3.1. I've also allocated 4 CPU and 4 GB of memory to docker.
greg@Gregs-MacBook-Air ~ % docker --version
Docker version 20.10.12, build e91ed57
greg@Gregs-MacBook-Air ~ % docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
98a83bb59614 lfoppiano/grobid:0.7.3 "./grobid-service/bi…" About an hour ago Up About an hour 0.0.0.0:8070->8070/tcp interesting_hellman
The help menu for pdfalto_server successfully prints as well.
greg@Gregs-MacBook-Air ~ % docker exec -it 98a83bb59614 /bin/bash
root@98a83bb59614:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server
pdfalto version 0.5
Usage: pdfalto [options] <PDF-file> [<xml-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-verbose : display pdf attributes
-noImage : do not extract Images (Bitmap and Vectorial)
-noImageInline : deprecated
-outline : create an outline file xml
-annotation : create an annotations file xml
-noLineNumbers : do not output line numbers added in manuscript-style textual documents
-readingOrder : blocks follow the reading order
-noText : do not extract textual objects (might be useful, but non-valid ALTO)
-charReadingOrderAttr : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
-fullFontName : fonts names are not normalized
-nsURI <string> : add the specified namespace URI
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-filesLimit <int> : limit of asset files be extracted
-q : don't print any messages or errors
-v : print version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
Happy to provide any other details that might be helpful!
@gjreda if you change the grobid address in the client configuration to https://kermitt2-grobid.hf.space does it work?
could you try to run pdfalto with a document?
apt-get update
apt-get install wget
wget https://mdr.nims.go.jp/downloads/wd375x09x?locale=en -o /tmp/bao.pdf
/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
and let me know if ti works, you can use any pdf
@gjreda if you change the grobid address in the client configuration to https://kermitt2-grobid.hf.space/ does it work?
This worked!
could you try to run pdfalto with a document?
This did not work and ultimately threw out the following error:
root@a9fe3565b220:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Full details below
greg@Gregs-MacBook-Air grobid-demo % docker exec -it a9fe3565b220 /bin/bash
root@a9fe3565b220:/opt/grobid# apt-get update
Get:1 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:2 http://deb.debian.org/debian bullseye InRelease [116 kB]
Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [240 kB]
Get:5 http://deb.debian.org/debian bullseye/main amd64 Packages [8183 kB]
Get:6 http://deb.debian.org/debian bullseye-updates/main amd64 Packages [14.6 kB]
Fetched 8646 kB in 8s (1056 kB/s)
Reading package lists... Done
root@a9fe3565b220:/opt/grobid# apt-get install wget
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libpsl5 publicsuffix
The following NEW packages will be installed:
libpsl5 publicsuffix wget
0 upgraded, 3 newly installed, 0 to remove and 28 not upgraded.
Need to get 1149 kB of archives.
After this operation, 4001 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://deb.debian.org/debian bullseye/main amd64 libpsl5 amd64 0.21.0-1.2 [57.3 kB]
Get:2 http://deb.debian.org/debian bullseye/main amd64 wget amd64 1.21-1+deb11u1 [964 kB]
Get:3 http://deb.debian.org/debian bullseye/main amd64 publicsuffix all 20220811.1734-0+deb11u1 [127 kB]
Fetched 1149 kB in 0s (3393 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libpsl5:amd64.
(Reading database ... 7312 files and directories currently installed.)
Preparing to unpack .../libpsl5_0.21.0-1.2_amd64.deb ...
Unpacking libpsl5:amd64 (0.21.0-1.2) ...
Selecting previously unselected package wget.
Preparing to unpack .../wget_1.21-1+deb11u1_amd64.deb ...
Unpacking wget (1.21-1+deb11u1) ...
Selecting previously unselected package publicsuffix.
Preparing to unpack .../publicsuffix_20220811.1734-0+deb11u1_all.deb ...
Unpacking publicsuffix (20220811.1734-0+deb11u1) ...
Setting up libpsl5:amd64 (0.21.0-1.2) ...
Setting up wget (1.21-1+deb11u1) ...
Setting up publicsuffix (20220811.1734-0+deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u3) ...
root@a9fe3565b220:/opt/grobid# wget https://mdr.nims.go.jp/downloads/wd375x09x?locale=en -o /tmp/bao.pdf
root@a9fe3565b220:/opt/grobid# ls -la /tmp/
total 20
drwxrwxrwt 1 root root 4096 May 17 18:00 .
drwxr-xr-x 1 root root 4096 May 17 17:59 ..
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
drwxr-xr-x 1 root root 4096 May 17 17:59 hsperfdata_root
root@a9fe3565b220:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
root@a9fe3565b220:/opt/grobid# ls -la
total 788
drwxr-xr-x 1 root root 4096 May 17 18:00 .
drwxr-xr-x 1 root root 4096 May 15 03:50 ..
drwxr-xr-x 1 root root 4096 May 17 17:59 grobid-home
drwxr-xr-x 4 root root 4096 May 15 03:52 grobid-service
drwxr-xr-x 2 root root 4096 May 17 17:59 logs
-rw-r--r-- 1 root root 774523 May 17 2020 'wd375x09x?locale=en'
root@a9fe3565b220:/opt/grobid# ls -la /tmp/
total 20
drwxrwxrwt 1 root root 4096 May 17 18:06 .
drwxr-xr-x 1 root root 4096 May 17 17:59 ..
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
drwxr-xr-x 1 root root 4096 May 17 17:59 hsperfdata_root
mmm checking the downloaded file size, there is something weird:
This is correct:
-rw-r--r-- 1 root root 774523 May 17 2020 'wd375x09x?locale=en'
This is too small:
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
Could you share there result of df -h
?
root@321d33972d1a:/opt/grobid# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 59G 24G 32G 44% /
tmpfs 64M 0 64M 0% /dev
shm 64M 0 64M 0% /dev/shm
/dev/vda1 59G 24G 32G 44% /etc/hosts
tmpfs 2.0G 0 2.0G 0% /sys/firmware
I'm out of ideas. 🤔 I'll run it on my M1 later and let you know if I encounter any issue.
@gjreda good news. I found the issue and is related to the M1. It seems that the fork mechanism does not work anymore (I did not understand why), anyway I had to add a parameter to the JDK: -Djdk.lang.Process.launchMechanism=vfork
I've pushed a new image lfoppiano/grobid:0.7.3-arm
which should work on M1. Also, since it is still built for linux/amd64, I recommend you to update docker to the version >=4.17 and enable Rosetta: https://collabnix.com/warning-the-requested-images-platform-linux-amd64-does-not-match-the-detected-host-platform-linux-arm64-v8/
Could you try it out and let me know?
I'm sorry, at the moment I'm a bit short of time to provide a proper multiplatform image.
@lfoppiano No need to apologize! I really appreciate your help.
The new image, upgrading docker, and enabling Rosetta got it working!
I'm still able to cause 500 errors if I request a larger batch - nine pdfs - on the first try, before the models have been loaded. This results in java.lang.OutOfMemoryError: Java heap space
. However, if I immediately try the same batch of files, it works. I suspect it is the combination of both loading the models and requesting a larger batch that results in the OOM as this does not happen if my first request is small (1-3 pdfs).
Another error that has popped up is rosetta error: futex(FUTEX_LOCK_PI_PRIVATE) failure: 35
in the container stdout, which expectedly breaks the client side connection, resulting in the below traceback. While I've seen this error a few times, I haven't been able to consistently reproduce it.
I'll follow up on this thread if I run into any more issues or figure out how to consistently reproduce the Rosetta error, but I think you've solved my issue. Thank you! I really appreciate your work.
@gjreda thanks! I will do more tests in the following weeks and update the documentation accordingly. The support on M1 it's a bit of a grey area also for me too.
I've done some more tests, I could process several PDFs till the servers stopped answering. There is something not working well in the interface with pdfalto and it's only a problem on M1.
For the OOM, I suggest you to add 2 more Gb of RAM, in general Grobid should run without problems with 4Gb, but it seems that with rosetta 4Gb are not enough.
We could solve all these problems with a arm64 build, however this will take some time.
If you have time please check this #1165
Hi grobid team!
I'm running the lightweight version of grobid via the docker container. I'm using 0.7.3.
Starting the container via
docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.3
works as expected and I'm able to load the web service at localhost:8070. However, when I load a PDF and submit the request, I get the error belowThe docker container outputs the attached errors and stacktrace: upload-errors.txt
Maybe relatedly, when using the python client, the service seems to get called properly, but errors as seen below.
I can also see that the txt files are created in the output directory, though they are empty (makes sense given the errors).
The docker container outputs the attached errors and stacktrace: api-errors.txt
Any idea what the underlying issue is? Am I calling the service improperly? Any help is very much appreciated!