kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.58k stars 458 forks source link

Using the Grobid service with customized models, stored externally #884

Open lucbouge opened 2 years ago

lucbouge commented 2 years ago

Dear Luca, dear Patrice,

I am currently running an experiment at ANR about extracting the publications from final reports of the projects of the last years. I have a batch of more than 400 reports to process for year 2016.

I managed to train Grobid about the specific format of those reports, which is hopefully not that different from a scientific paper My question is now to run the service.

According to the documentation, I started the service by running ./gradlew and used the Grobid Python client. I works fine, but it uses the models which are physically in the grobid repository, at grobid-home/models/. This leads to modify the Git files, which prevents any further pull update.

Is there a way of activating the service with some external set of models? Sort of a --model /path/to/models option?

Regards, Luc.

kermitt2 commented 2 years ago

Hi @lucbouge !

I wish you a very good new year and I hope you're going fine.

Normally you can point to an external grobid home by modifying the configuration file, see https://grobid.readthedocs.io/en/latest/Configuration/#grobid-home

The models under this external grobid-home/models/ will be used.

However, if you work with a fork, updated/push in the fork with different models, you can still normally merge new updates from the original repository without troubles (see for instance https://www.c-sharpcorner.com/article/how-to-merge-upstream-repository-changes-with-your-fork-repository-using-git/). You might have conflicts if the same models are updated in the original repository, but can still manage that manually.

In one of my grobid fork, I have a mechanism to manage custom models (called "flavors") but I didn't find the time yet to merge it - it would also answer your need I think.