kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 458 forks source link

Underlying ML architecture for the header model #747

Closed igor261 closed 3 years ago

igor261 commented 3 years ago

Hello,

i´ve been using the grobid header model (v. 0.6.1).

In the documentation I found that a Wapiti CRF model is used by default, but in the config.json of the header model the model type is specified as "BidLSTM_CRF".

So which architecture is hidden behind the header model?

Best regards, Igor

kermitt2 commented 3 years ago

Hello @igor261 !

Yes it was confusing to mix different models in the same directory. Since version 0.6.2, the models are in different directories depending on their architecture, e.g.:

models/header/
models/header-BidLSTM_CRF/
models/header-BidLSTM_CRF_FEATURES/
models/header-scibert/

header/ alone corresponds to the default CRF model (model.wapiti). In the future it might be renamed header-CRF/ for more clarity.

However, it was always the model indicated in the configuration file that was actually used.

igor261 commented 3 years ago

Hello @kermitt2,

thank you for the fast response, i appreciate it!

So in the header directory of version 0.6.1 there are the following files:

While performing the N-folds cross-evaluation for header model (as described here: https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/#n-folds-cross-evaluation) for every fold there is a .wapiti file created.

So my understanding is that despite the config.json indicates a BiLSTM-CRF model, per default the CRF model is used. In order to train a BiLSTM-CRF model I have to follow this documentation (https://grobid.readthedocs.io/en/latest/Deep-Learning-models/#using-deep-learning-models-instead-of-default-crf) right?

kermitt2 commented 3 years ago

Hello !

Sorry if my first answer was not clear. In 0.6.1, models are all in the same directory, somodel.wapiti for CRF, the rest being relative to BiLSTM-CRF - so config.json only gives the config of the BiLSTM-CRF part and it has no other global meaning.

So you'd better use the latest released version 0.6.2 for something more organized. This latest version is shipped with 3 different header models in 3 different repositories.

Then the actual model to be used is selected via the GROBID config file (grobid.properties, config.yaml in the future version), with default always set to the CRF ones. For instance, you would need to specify in this global config file the model to select for header processing prior to launching the n-fold cross evaluation (or you could simply launch the n-fold cross eval in DeLFT directly to stay in the "python" world).

So my understanding is that despite the config.json indicates a BiLSTM-CRF model, per default the CRF model is used.

yes the config.json files only refer to the deep learning model in the same directory (it gives the parameters of the local DL model), it is not used for selecting the model to be loaded by GROBID among all available. This selection is done in the global config GROBID property file as explained here.

In order to train a BiLSTM-CRF model I have to follow this documentation (https://grobid.readthedocs.io/en/latest/Deep-Learning-models/#using-deep-learning-models-instead-of-default-crf) right?

yes, you can also train it via https://github.com/kermitt2/delft#grobid-models by indicating the GROBID generated training files in the command line of the script grobidTagger.py, which might be more flexible and easier to tune. Otherwise GROBID is using anyway DeLFT for the training via the JNI native interpreter JEP.