kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 449 forks source link

Grobid docker container - location of grobid-trainer #1167

Open cboulanger opened 3 weeks ago

cboulanger commented 3 weeks ago

Hi, I am using the grobid/grobid/0.8.0 docker container converted to an apptainer image. I want to add new bibliographic training data dn use the Training API. However, I do not find the location of the training files - are they omitted from the docker image?

lfoppiano commented 3 weeks ago

Hi @cboulanger, indeed good question. To avoid creating docker images that are too big, the training data are not included by default. You could mount the training data directory (which makes is easier to add / remove files) when you run the docker image.

I would mount the volume linking the local grobid-trainer directory to the docker image's directory /opt/grobid/grobid-trainer. I've checked and debugged the code locally, but I did not test it whether this works with the docker image.

cboulanger commented 3 weeks ago

Hi, thanks!

I have downloaded the Grobid source and bound the training-related directories to the container (from the 0.8.1-full image). Now I run into the next problem. I want to create the training files from a bunch of PDFs in a directory on the host. I do this:

module load apptainer/1.3.2 && apptainer exec \
    --nv --no-mount home,cwd --cleanenv --writable-tmpfs \
    --bind ./lib/grobid_0.8.1/grobid-trainer/:/opt/grobid/grobid-trainer \
    --bind ./output/grobid-training:/grobid-training \
    --bind ./pdf:/pdf \
    ~/sif/grobid_0.8.1.sif \
    java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep \
        -jar /opt/grobid/grobid-service/lib/grobid-core-0.8.2-SNAPSHOT.jar \
        -gH /grobid-home \
        -dIn /pdf -dOut /grobid-training \
        -exe createTraining

But I am getting this:

Note: APPTAINER_CACHEDIR and APPTAINER_TMPDIR not set, set manually before building images!
INFO:    fuse2fs not found, will not be able to mount EXT3 filesystems
INFO:    gocryptfs not found, will not be able to use gocryptfs
no main manifest attribute, in /opt/grobid/grobid-service/lib/grobid-core-0.8.2-SNAPSHOT.jar

How do I need to rewrite the above command to make it succeed? Thank you!

lfoppiano commented 3 weeks ago

Hi @cboulanger I'm not sure what you're trying to do (and... I have no experience with apptainer).

What was in my mind when I wrote my previous comment was that you can mount the directory with the container on the host machine, given that you have access to it. Then you operate independently from the container to be running or not.

The Java command should be called from the host, but I'm not sure it's actually possible in your case. 🤔

cboulanger commented 3 weeks ago

What I am trying to do is to use Grobid on a High Performance Cluster which runs jobs only in containerized form. This means that I cannot do stuff involving a GPU unless it runs inside a container. Maybe that's not necessary for the "createTraining" batch job but it probably is for others.

But to solve the problem at hand: the problem seems to be that I would need to build the project locally to have the compiled jar files on the host, wouldn't I? That would defeat the purpose of using the images in order not to have to set up a build environment. Or did I misunderstand something?

lfoppiano commented 3 weeks ago

ah, sorry, you would need to run the grobid-core-0.8.2-SNAPSHOT-onejar.jar which is the self-contained executable JAR

cboulanger commented 3 weeks ago

Thank you, sorry to be such a bother, where do I find that file :-) ?

lfoppiano commented 3 weeks ago

It should be under build/libs:

ubuntu@ip-172-31-24-40:~/grobid/grobid-core/build/libs$ ls
grobid-core-0.8.2-SNAPSHOT-onejar.jar  grobid-core-0.8.2-SNAPSHOT-sources.jar  grobid-core-0.8.2-SNAPSHOT.jar
cboulanger commented 3 weeks ago

Ok - I see - grobid-core is not part of the container. So I guess I won't be able to avoid setting up a development environment and build the project... Hope I'll manage!

lfoppiano commented 3 weeks ago

Yes, the docker image has been built to be efficient in term of disk space, so we've left out all the stuff, but we might consider having a docker image that allow performing both evaluation and training.

kermitt2 commented 3 weeks ago

Hello! There is a training web API already part of the Grobid service (typically as container with mounted paths), to start a training, get progress info, evaluation and fetch the trained model. A simple addition for this API would be the "createTraining" with a PDF as input and that should allow to do the full training without command line.

cboulanger commented 3 weeks ago

Hello! There is a training web API already part of the Grobid service (typically as container with mounted paths), to start a training, get progress info, evaluation and fetch the trained model. A simple addition for this API would be the "createTraining" with a PDF as input and that should allow to do the full training without command line.

That would be great- I am trying to build from source and am already running into problems with the Java version. An image that includes training would be most useful!

cboulanger commented 3 weeks ago

For the record: I am sure you mention it somewhere in the docs but it wasn't immediately clear to me that I had to use Java 11 instead of the Java 21 that I had in my environment. Then it worked as described in the docs with

java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep \
    -jar ./lib/grobid_0.8.1/grobid-core/build/libs/grobid-core-0.8.1-onejar.jar \
    -gH ./lib/grobid_0.8.1/grobid-home \
    -dIn ./pdf -dOut ./output/grobid-training \
    -exe createTraining

But as we agreed, it would be much better to have a special version of the image that included all training-related commands as services that could be invoked via the API - if the API clients could profit from that new API methods, it would be even better. Until then, I now know how to run the commands from the built source. Thanks!

lfoppiano commented 3 weeks ago

Even with the API to create training data from PDF, you would have to access the files somehow to correct them, and move them to the respective directories. So a mounted volume is necessary.

Having said that, indeed not having to deal with running the stuff in local, finding the right jvm etc.. could improve our experience, yes. By the way, I'm wondering, JDK 21 should be able to run JDK 11, isn't it?

cboulanger commented 3 weeks ago

o a mounted volume is necessary.

Certainly. one would have to use mounts to get the datasets and models in and out of the container

By the way, I'm wondering, JDK 21 should be able to run JDK 11, isn't it?

It did not work, I first tried upgrading the version in build.gradle, but the dependencies between the java, gradle, and kotlin versions were such that it could not be made to work. I think upgrading will involve some changes (there were also warnings about "deprecated gradle commands" or something).

In any case, I have ~60 annotated articles involving footnotes (which is so far unsupported by Grobid) in the AnyStyle annotation format that I would love to contribute to the Grobid Ground Truth so that Grobid can better perform in the domain of the Humanities and Social Sciences. I'll first convert the datasets for the citation model only because that is the easiest and we want to compare our own LLM-based extraction method against Grobid's.

cboulanger commented 3 weeks ago

Hi, following up on this - I am thinking of writing an apptainer build script from scratch instead of trying to work with the docker images, so that I can just use the source as it is and build and run it in the container. I am wondering: the Dockerfile including DL isn't just building the source with gradle, but performing very complex post-build operations. Why is that? Doesn't the repo if you run it with ./gradlew clean install include the Deep Leaning stuff? How would you go about if the image size does not matter and the build steps should be minimal, i.e. I would just do a local build with DL within the container?

lfoppiano commented 3 weeks ago

The DL image require some additional library (python-based, such as tensorflow) to be installed in specific way (usually troublesome). If you could make your apptainer image, starting from a docker image it would be probably better as it will come with the problematic libraries already placed in the right place.