kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

Cannot load models trained with ELMo #991

Closed lfoppiano closed 1 year ago

lfoppiano commented 1 year ago

I tried both on Apple M1 and Docker on Linux, the issue is similar, but let's take the docker as reference.

Configuration:


    - name: "citation"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 50
        nbMaxIterations: 3000
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        #architecture: "scibert"
        useELMo: true
        embeddings_name: "glove-840B"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 20
        training:
          # parameters used for training
          max_sequence_length: 3000
          batch_size: 30

Error log, as you can see the models without ELMo are loading without issues:

Mar 10 09:41:09 falcon docker[20621]: INFO  [2023-03-10 00:41:09,277] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for header with architecture BidLSTM_CRF_FEATURES...
Mar 10 09:41:12 falcon docker[20621]: load weights from /opt/grobid/grobid-home/models/header-BidLSTM_CRF_FEATURES/model_weights.hdf5
Mar 10 09:41:12 falcon docker[20621]: loading model weights /opt/grobid/grobid-home/models/header-BidLSTM_CRF_FEATURES/model_weights.hdf5
Mar 10 09:41:12 falcon docker[20621]: Model: "model_3"
Mar 10 09:41:12 falcon docker[20621]: __________________________________________________________________________________________________
Mar 10 09:41:12 falcon docker[20621]: Layer (type)                   Output Shape         Param #     Connected to
Mar 10 09:41:12 falcon docker[20621]: ==================================================================================================
Mar 10 09:41:12 falcon docker[20621]: features_input (InputLayer)    [(None, None, 22)]   0           []
Mar 10 09:41:12 falcon docker[20621]: char_input (InputLayer)        [(None, None, 30)]   0           []
Mar 10 09:41:12 falcon docker[20621]: features_embedding_td (TimeDis  (None, None, 22, 4)  1060       ['features_input[0][0]']
Mar 10 09:41:12 falcon docker[20621]: tributed)
Mar 10 09:41:12 falcon docker[20621]: time_distributed_6 (TimeDistri  (None, None, 30, 25  8275       ['char_input[0][0]']
Mar 10 09:41:12 falcon docker[20621]: buted)                         )
Mar 10 09:41:12 falcon docker[20621]: features_embedding_td_2 (TimeD  (None, None, 8)     288         ['features_embedding_td[0][0]']
Mar 10 09:41:12 falcon docker[20621]: istributed)
Mar 10 09:41:12 falcon docker[20621]: word_input (InputLayer)        [(None, None, 300)]  0           []
Mar 10 09:41:12 falcon docker[20621]: time_distributed_7 (TimeDistri  (None, None, 50)    10200       ['time_distributed_6[0][0]']
Mar 10 09:41:12 falcon docker[20621]: buted)
Mar 10 09:41:12 falcon docker[20621]: dropout_9 (Dropout)            (None, None, 8)      0           ['features_embedding_td_2[0][0]']
Mar 10 09:41:12 falcon docker[20621]: concatenate_3 (Concatenate)    (None, None, 358)    0           ['word_input[0][0]',
Mar 10 09:41:12 falcon docker[20621]: 'time_distributed_7[0][0]',
Mar 10 09:41:12 falcon docker[20621]: 'dropout_9[0][0]']
Mar 10 09:41:12 falcon docker[20621]: dropout_10 (Dropout)           (None, None, 358)    0           ['concatenate_3[0][0]']
Mar 10 09:41:12 falcon docker[20621]: bidirectional_11 (Bidirectiona  (None, None, 200)   367200      ['dropout_10[0][0]']
Mar 10 09:41:12 falcon docker[20621]: l)
Mar 10 09:41:12 falcon docker[20621]: dropout_11 (Dropout)           (None, None, 200)    0           ['bidirectional_11[0][0]']
Mar 10 09:41:12 falcon docker[20621]: length_input (InputLayer)      [(None, 1)]          0           []
Mar 10 09:41:12 falcon docker[20621]: dense_6 (Dense)                (None, None, 100)    20100       ['dropout_11[0][0]']
Mar 10 09:41:12 falcon docker[20621]: ==================================================================================================
Mar 10 09:41:12 falcon docker[20621]: Total params: 407,123
Mar 10 09:41:12 falcon docker[20621]: Trainable params: 407,123
Mar 10 09:41:12 falcon docker[20621]: Non-trainable params: 0
Mar 10 09:41:12 falcon docker[20621]: __________________________________________________________________________________________________
Mar 10 09:41:12 falcon docker[20621]: Model: "crf_model_wrapper_default_3"
Mar 10 09:41:12 falcon docker[20621]: _________________________________________________________________
Mar 10 09:41:12 falcon docker[20621]: Layer (type)                Output Shape              Param #
Mar 10 09:41:12 falcon docker[20621]: =================================================================
Mar 10 09:41:12 falcon docker[20621]: crf_3 (CRF)                 multiple                  5720
Mar 10 09:41:12 falcon docker[20621]: model_3 (Functional)        (None, None, 100)         407123
Mar 10 09:41:12 falcon docker[20621]: =================================================================
Mar 10 09:41:12 falcon docker[20621]: Total params: 412,843
Mar 10 09:41:12 falcon docker[20621]: Trainable params: 412,843
Mar 10 09:41:12 falcon docker[20621]: Non-trainable params: 0
Mar 10 09:41:12 falcon docker[20621]: _________________________________________________________________
Mar 10 09:41:12 falcon docker[20621]: INFO  [2023-03-10 00:41:12,378] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/date/model.wapiti (size: 102435)
Mar 10 09:41:12 falcon docker[20621]: [Wapiti] Loading model: "/opt/grobid/grobid-home/models/date/model.wapiti"
Mar 10 09:41:12 falcon docker[20621]: Model path: /opt/grobid/grobid-home/models/date/model.wapiti
Mar 10 09:41:12 falcon docker[20621]: INFO  [2023-03-10 00:41:12,385] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for citation with architecture BidLSTM_CRF_FEATURES...
Mar 10 09:41:12 falcon docker[20621]: ELMo weights used: /opt/elmo/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5
Mar 10 09:41:12 falcon docker[20621]: ERROR [2023-03-10 00:41:12,388] org.grobid.core.jni.DeLFTModel: DeLFT model initialization failed.
Mar 10 09:41:12 falcon docker[20621]: ! jep.JepException: <class 'SystemExit'>: Error: either provide a path to a directory with the ELMo model individual options and weight file or to the model in a ZIP archive.
Mar 10 09:41:12 falcon docker[20621]: ! at /usr/local/lib/python3.8/dist-packages/delft/utilities/simple_elmo/elmo_helpers.load(elmo_helpers.py:131)
Mar 10 09:41:12 falcon docker[20621]: ! at /usr/local/lib/python3.8/dist-packages/delft/utilities/Embeddings.make_ELMo(Embeddings.py:320)
Mar 10 09:41:12 falcon docker[20621]: ! at /usr/local/lib/python3.8/dist-packages/delft/utilities/Embeddings.__init__(Embeddings.py:82)
Mar 10 09:41:12 falcon docker[20621]: ! at /usr/local/lib/python3.8/dist-packages/delft/sequenceLabelling/wrapper.load(wrapper.py:575)
Mar 10 09:41:12 falcon docker[20621]: ! at <string>.<module>(<string>:1)
Mar 10 09:41:12 falcon docker[20621]: ! at jep.Jep.eval(Native Method)
Mar 10 09:41:12 falcon docker[20621]: ! at jep.Jep.eval(Jep.java:312)
Mar 10 09:41:12 falcon docker[20621]: ! at org.grobid.core.jni.DeLFTModel$InitModel.run(DeLFTModel.java:65)
Mar 10 09:41:12 falcon docker[20621]: ! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Mar 10 09:41:12 falcon docker[20621]: ! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Mar 10 09:41:12 falcon docker[20621]: ! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Mar 10 09:41:12 falcon docker[20621]: ! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Mar 10 09:41:12 falcon docker[20621]: ! at java.lang.Thread.run(Thread.java:750)
Mar 10 09:41:12 falcon docker[20621]: [Wapiti] Loading model: "/opt/grobid/grobid-home/models/fulltext/model.wapiti"
Mar 10 09:41:12 falcon docker[20621]: INFO  [2023-03-10 00:41:12,393] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/fulltext/model.wapiti (size: 26707735)
Mar 10 09:41:14 falcon docker[20621]: Model path: /opt/grobid/grobid-home/models/fulltext/model.wapiti

The content of grobid-home:

image

lfoppiano commented 1 year ago

From a comment in the Dockerconfig file, we should add:

RUN python3 preload_embeddings.py --embedding elmo-en --registry ./resources-registry.json

to download the elmo embeddings, right?

kermitt2 commented 1 year ago

Hi @lfoppiano ! Thank you for the issue. For some reasons, I keep forgetting to add to DeLFT the automatic download of ELMo embeddings like the other embeddings (it should 2-3 lines to add).

So for the moment it has to be done manually. for example for the English ELMo:

cd /path/to/store/elmo
wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json
wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5

preload_embeddings.py is more for preparing Docker images, it is not working with ELMo embeddings I think, just for word embeddings (I've never packaged ELMo in the docker image).

To do (very simple):

Then it would work like the other embeddings.

kermitt2 commented 1 year ago

add the automatic ELMo embeddings download like for the other embeddings

see PR https://github.com/kermitt2/delft/pull/157

See also https://github.com/kermitt2/grobid/issues/946 for the same issue.