kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.6k stars 461 forks source link

Managing model specializations/variants (flavors) #1151

Open kermitt2 opened 3 months ago

kermitt2 commented 3 months ago

This PR introduces a simple management of alternative models to use when processing a document. A model variant is for example a alternative header model trained with its own training data and labels (to cover documents with specific header section different from scholar articles), or an alternative segmentation model for segmenting something else than scholar papers.

To process a document with alternative model(s), we use a string called "flavor" to identify it. If the flavor is indicated, the selected model will use the "flavor" model if it exists, and the normal model if the flavor does exist for this model (so defaulting back then to the standard models).

Flavor model training data are always located as sub-directory of the standard training data path, e.g. for the flavor "sdo/ietf", the training data of the header model for this flavor will be under grobid-trainer/resources/dataset/header/sdo/ietf. The training data of the segmentation model for this flavor will be under grobid-trainer/resources/dataset/segmentation/sdo/ietf, and so on.

For running grobid following a particular flavor, we add the flavor name as additional parameter of the service:

curl -v --form input=@./nihms834197.pdf --form "flavor=sdo/ietf" localhost:8070/api/processFulltextDocument

Grobid will then solve the right models to use given the Grobid model hierarchy/cascade: the flavor ones if they exist, or the default ones if not.

In this branch, "sdo/ietf" flavor is just used as an example, the corresponding flavor models are not actually really working.

lfoppiano commented 3 months ago

Thanks!

I'm going to test it directly in the branch for the light-segmentation models :-) and feedback changes

lfoppiano commented 4 days ago

Here an updated figure showing the mechanism, could be useful for the documentation, which needs to be updated: which I already added in the updated documentation.

image

coveralls commented 4 days ago

Coverage Status

coverage: 40.745% (-0.04%) from 40.781% when pulling aa108efdb6f4aa592fc4da4edc4f54da01c1e0e0 on flavor into f983165a76693dc51cbdc1eb0103ea67e6896863 on master.