kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.42k stars 444 forks source link

Questions about running GROBID on a HPC cluster #1080

Open kyuhunl opened 7 months ago

kyuhunl commented 7 months ago

Hi, I am trying to use GROBID on my organization's HPC cluster. I have tried to use it on my own laptop (ARM Macbook), but I keep having trouble running the docker image. Our cluster does not support running docker images but supports singularity. Will pulling GROBID as a singularity image on the cluster work? Also, our HPC administrator recommends running GROBID in batch mode, contrary to your recommendation. What kind of issues should I expect when using batch mode instead of service mode?

lfoppiano commented 5 months ago

@kyuhunl, it should work on singularity, however, I never tried, nor do I have access to an HPC service.

If you run it without docker/singularity you might have issues configuring the deep learning models. Generally, running the batch mode is effective when you pass to it the directory with all the PDF documents. If you run the command for the batch mode multiple times, it will spend a lot of time loading/unloading models, for which the service on a normal server(s) might be more effective.