Open kermitt2 opened 3 months ago
Thanks!
I'm going to test it directly in the branch for the light-segmentation models :-) and feedback changes
Here an updated figure showing the mechanism, could be useful for the documentation, which needs to be updated: which I already added in the updated documentation.
This PR introduces a simple management of alternative models to use when processing a document. A model variant is for example a alternative header model trained with its own training data and labels (to cover documents with specific header section different from scholar articles), or an alternative segmentation model for segmenting something else than scholar papers.
To process a document with alternative model(s), we use a string called "flavor" to identify it. If the flavor is indicated, the selected model will use the "flavor" model if it exists, and the normal model if the flavor does exist for this model (so defaulting back then to the standard models).
Flavor model training data are always located as sub-directory of the standard training data path, e.g. for the flavor "sdo/ietf", the training data of the header model for this flavor will be under
grobid-trainer/resources/dataset/header/sdo/ietf
. The training data of the segmentation model for this flavor will be undergrobid-trainer/resources/dataset/segmentation/sdo/ietf
, and so on.For running grobid following a particular flavor, we add the flavor name as additional parameter of the service:
Grobid will then solve the right models to use given the Grobid model hierarchy/cascade: the flavor ones if they exist, or the default ones if not.
In this branch, "sdo/ietf" flavor is just used as an example, the corresponding flavor models are not actually really working.