kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.44k stars 443 forks source link

Deep Learning model for header needs to be updated #843

Closed anafandon closed 1 year ago

anafandon commented 2 years ago

Hi, and congrats on your great work with Grobid. I managed to set up Grobid with the docker image you provide on this guide

Everything works perfect until I want to switch to deep learning models. I set up the yaml file as you mentioned in the guide and the I run the following command WITHOUT the --gpus all flag since my machine (linux) doesn't have gpus)

sudo docker run --rm --init -p 8080:8070 -p 8081:8071 -v /home/ubuntu/grobid_client_python/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.1-SNAPSHOT

but the following error pops out:

loading model weights /opt/grobid/grobid-home/models/affiliation-address-BidLSTM_CRF_FEATURES/model_weights.hdf5
2021-10-19 15:23:19.286134: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (-1)
INFO  [2021-10-19 15:23:19,662] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for name-header with architecture BidLSTM_CRF_FEATURES...
running thread: 23
loading model weights /opt/grobid/grobid-home/models/name-header-BidLSTM_CRF_FEATURES/model_weights.hdf5
running thread: 23
INFO  [2021-10-19 15:23:22,631] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for name-citation with architecture BidLSTM_CRF_FEATURES...
loading model weights /opt/grobid/grobid-home/models/name-citation-BidLSTM_CRF_FEATURES/model_weights.hdf5
INFO  [2021-10-19 15:23:25,761] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for header with architecture BidLSTM_CRF_FEATURES...
running thread: 23
loading model weights /opt/grobid/grobid-home/models/header-BidLSTM_CRF_FEATURES/model_weights.hdf5
running thread: 23
INFO  [2021-10-19 15:23:29,314] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for date with architecture BidLSTM_CRF_FEATURES...
loading model weights /opt/grobid/grobid-home/models/date-BidLSTM_CRF_FEATURES/model_weights.hdf5
running thread: 23
INFO  [2021-10-19 15:23:33,058] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for citation with architecture BidLSTM_CRF_FEATURES...
loading model weights /opt/grobid/grobid-home/models/citation-BidLSTM_CRF_FEATURES/model_weights.hdf5
INFO  [2021-10-19 15:23:37,327] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/fulltext/model.wapiti (size: 26707735)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/fulltext/model.wapiti"
Model path: /opt/grobid/grobid-home/models/fulltext/model.wapiti
INFO  [2021-10-19 15:23:41,123] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/segmentation/model.wapiti (size: 31133177)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/segmentation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/segmentation/model.wapiti
running thread: 23
INFO  [2021-10-19 15:23:46,156] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for reference-segmenter with architecture BidLSTM_CRF_FEATURES...
running thread: 23
INFO  [2021-10-19 15:23:46,158] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for figure with architecture BidLSTM_CRF_FEATURES...
running thread: 23
INFO  [2021-10-19 15:23:46,160] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for table with architecture BidLSTM_CRF_FEATURES...
INFO  [2021-10-19 15:23:46,162] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
ERROR [2021-10-19 15:24:09,635] org.grobid.core.jni.DeLFTModel: DeLFT model labelling via JEP failed
! jep.JepException: <class 'ValueError'>: Error when checking input: expected features_input to have shape (None, 22) but got array with shape (212, 21)
! at /usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.standardize_input_data(training_utils.py:138)
! at /usr/local/lib/python3.6/dist-packages/keras/engine/training._standardize_user_data(training.py:751)
! at /usr/local/lib/python3.6/dist-packages/keras/engine/training.predict_on_batch(training.py:1268)
! at /usr/local/lib/python3.6/dist-packages/delft/sequenceLabelling/tagger.tag(tagger.py:85)
! at /usr/local/lib/python3.6/dist-packages/delft/sequenceLabelling/wrapper.tag(wrapper.py:391)
! at <string>.<module>(<string>:1)
! at jep.Jep.getValue(Native Method)
! at jep.Jep.getValue(Jep.java:487)
! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:127)
! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:81)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)
lfoppiano commented 2 years ago

Dear @anafandon, could you please share the configuration file?

anafandon commented 2 years ago

Hi, @lfoppiano and thanks a lot of the response. What I only changed in the yaml file is that I commented the "engine wapiti" and I de-commented the "engine delft".

What are your thoughts?

Below is my yaml file:

# this is the configuration file for the GROBID instance

grobid:
  # where all the Grobid resources are stored (models, lexicon, native libraries, etc.), normally no need to change
  grobidHome: "grobid-home"

  # path relative to the grobid-home path (e.g. grobid-home/tmp)
  temp: "tmp"

  # normally nothing to change here, path relative to the grobid-home path (e.g. grobid-home/lib)
  nativelibrary: "lib"

  pdf:
    pdfalto:
      # path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
      path: "pdfalto"
      # security for PDF parsing
      memoryLimitMb: 6096
      timeoutSec: 60

    # security relative to the PDF parsing result
    blocksMax: 100000
    tokensMax: 1000000

  consolidation:
    # define the bibliographical data consolidation service to be used, either "crossref" for CrossRef REST API or 
    # "glutton" for https://github.com/kermitt2/biblio-glutton
    #service: "crossref"
    service: "glutton"
    glutton:
      url: "https://cloud.science-miner.com/glutton"
      #url: "http://localhost:8080" 
    crossref:
      mailto: 
      # to use crossref web API, you need normally to use it politely and to indicate an email address here, e.g. 
      #mailto: "toto@titi.tutu"
      token:
      # to use Crossref metadata plus service (available by subscription)
      #token: "yourmysteriouscrossrefmetadataplusauthorizationtokentobeputhere"

  proxy:
    # proxy to be used when doing external call to the consolidation service
    host: 
    port: 

  # CORS configuration for the GROBID web API service
  corsAllowedOrigins: "*"
  corsAllowedMethods: "OPTIONS,GET,PUT,POST,DELETE,HEAD"
  corsAllowedHeaders: "X-Requested-With,Content-Type,Accept,Origin"

  # the actual implementation for language recognition to be used
  languageDetectorFactory: "org.grobid.core.lang.impl.CybozuLanguageDetectorFactory"

  # the actual implementation for optional sentence segmentation to be used (PragmaticSegmenter or OpenNLP)
  #sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"
  sentenceDetectorFactory: "org.grobid.core.lang.impl.OpenNLPSentenceDetectorFactory"

  # maximum concurrency allowed to GROBID server for processing parallel requests - change it according to your CPU/GPU capacities
  # for a production server running only GROBID, set the value slightly above the available number of threads of the server
  # to get best performance and security
  concurrency: 10
  # when the pool is full, for queries waiting for the availability of a Grobid engine, this is the maximum time wait to try 
  # to get an engine (in seconds) - normally never change it
  poolMaxWait: 1

  delft:
    # DeLFT global parameters
    # delft installation path if Deep Learning architectures are used to implement one of the sequence labeling model, 
    # embeddings are usually compiled as lmdb under delft/data (this paramter is ignored if only featured-engineered CRF are used)
    install: "../delft"
    pythonVirtualEnv:

  wapiti:
    # Wapiti global parameters
    # number of threads for training the wapiti models (0 to use all available processors)
    nbThreads: 0

  models:
    # we configure here how each sequence labeling model should be implemented
    # for feature-engineered CRF, use "wapiti" and possible training parameters are window, epsilon and nbMaxIterations
    # for Deep Learning, use "delft" and select the target DL architecture (see DeLFT library), the training 
    # parameters then depends on this selected DL architecture 

    - name: "segmentation"
      # at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0000001
        window: 50
        nbMaxIterations: 2000

    - name: "fulltext"
      # at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0001
        window: 20
        nbMaxIterations: 1500

    - name: "header"
      #engine: "wapiti"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only  
        epsilon: 0.000001
        window: 30
        nbMaxIterations: 1500
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 1

    - name: "reference-segmenter"
      #engine: "wapiti"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "name-header"
      #engine: "wapiti"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "name-citation"
      #engine: "wapiti"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "date"
      #engine: "wapiti"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "figure"
      #engine: "wapiti"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "table"
      #engine: "wapiti"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:  
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "affiliation-address"
      #engine: "wapiti"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "citation"
      #engine: "wapiti"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 50
        nbMaxIterations: 3000
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        #architecture: "scibert"
        useELMo: false
        embeddings_name: "glove-840B"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 20
        training:
          # parameters used for training
          max_sequence_length: 3000  
          batch_size: 30

    - name: "patent-citation"
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0001
        window: 20

  # for **service only**: how to load the models, 
  # false -> models are loaded when needed (default), avoiding putting in memory useless models but slow down significantly
  #          the service at first call
  # true -> all the models are loaded into memory at the server startup, slow the start of the services and models not
  #         used will take some memory, but server is immediatly warm and ready
  modelPreload: false

server:
    type: custom
    applicationConnectors:
    - type: http
      port: 8070
    adminConnectors:
    - type: http
      port: 8071
    registerDefaultExceptionMappers: false

logging:
  level: INFO
  loggers:
    org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
  appenders:
    - type: console
      threshold: ALL
      timeZone: UTC
    - type: file
      currentLogFilename: logs/grobid-service.log
      threshold: ALL
      archive: true
      archivedLogFilenamePattern: logs/grobid-service-%d.log
      archivedFileCount: 5
      timeZone: UTC
lfoppiano commented 2 years ago

In brief, not all the models used in Grobid exists for deep learning and few are not available in the architecture that is selected by default in the configuration file. We need to improve this part.

Meanwhile, could you please change back to wapiti the following models:

and try again?

PS: for the sake of having the comments in order I'm re-posting my answer here.

anafandon commented 2 years ago

@lfoppiano Sorry I mistakenly tapped the "close with comment" .

So, I did what you said but had no effect, I get the same error. I played a bit around by changing back to wapiti the

It worked, but I'd guess that was the expected since I tried to extract the header curl -v --form input=@./nogueira.pdf localhost:8080/api/processHeaderDocument

An other interesting issue arisen is that with "processFulltextDocument" stopped working completely.

ubuntu@stergiaras-react-dev:~/grobid_client_python/inpdfs$ curl -v --form input=@./nogueira.pdf localhost:8070/api/processFulltextDocument
*   Trying 127.0.0.1:8070...
* TCP_NODELAY set
* connect to 127.0.0.1 port 8070 failed: Connection refused
* Failed to connect to localhost port 8070: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 8070: Connection refused`

And it start working again only after I put back everything to wapiti

lfoppiano commented 2 years ago

So, I did what you said but had no effect, I get the same error. I played a bit around by changing back to wapiti the

* name: "header"

It worked, but I'd guess that was the expected since I tried to extract the header curl -v --form input=@./nogueira.pdf localhost:8080/api/processHeaderDocument

mmm, that's correct, however, the header model should not pose any problem (I checked and all the models and architectures are available) ...

Anyway, considering also the problem with the fulltext extraction. Could you restart the docker with the new configuration (leave the header as delft), run the processFulltext and paste here the log again.

Two notes:

lfoppiano commented 2 years ago

For reference I add my configuration which works with Deep Learning

    - name: "segmentation"
      # at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0000001
        window: 50
        nbMaxIterations: 2000

    - name: "fulltext"
      # at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0001
        window: 20
        nbMaxIterations: 1500

    - name: "header"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only  
        epsilon: 0.000001
        window: 30
        nbMaxIterations: 1500
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 1

    - name: "reference-segmenter"
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "name-header"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "name-citation"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "date"
      engine: "wapiti"
      #engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "figure"
      engine: "wapiti"
      #engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "table"
      engine: "wapiti"
      #engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:  
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "affiliation-address"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "citation"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 50
        nbMaxIterations: 3000
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        #architecture: "scibert"
        useELMo: false
        embeddings_name: "glove-840B"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 20
        training:
          # parameters used for training
          max_sequence_length: 3000  
          batch_size: 30

    - name: "patent-citation"
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0001
        window: 20
anafandon commented 2 years ago

The configuration your provided still doesn't work for the processHeaderDocument, it gives me the same error. Though the fulltext is working.

everything works like charm! with your configuration :D:D:D

Thanks a lot for all the help @lfoppiano and your amazing work in grobid :)

kermitt2 commented 2 years ago

Just pointing to the documentation (for the sake of the documenting effort!):

Each model has its own configuration indicating:

- for Deep Learning models, which neural architecture to be used, with choices normally among BidLSTM_CRF, 
BidLSTM_CRF_FEATURES, bert-base-en and scibert. The corresponding model/architecture combination 
need to be available under grobid-home/models/. 

Except fulltext and segmentation models (too large sequence input size), all the Grobid models exist in a deep learning implementation but not in every architectures.

lfoppiano commented 2 years ago

Actually, I'm afraid there is a problem with the header model with DeLFT...

By switching it back to wapiti or using the version 0.7.0 the problem disappear:

Oct 20 18:12:26 falcon docker[22147]: INFO  [2021-10-20 09:12:26,050] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
Oct 20 18:12:36 falcon docker[22147]: ERROR [2021-10-20 09:12:36,830] org.grobid.core.jni.DeLFTModel: DeLFT model labelling via JEP failed
Oct 20 18:12:36 falcon docker[22147]: ! jep.JepException: <class 'ValueError'>: Error when checking input: expected features_input to have shape (None, 22) but got array with shape (435, 21)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.standardize_input_data(training_utils.py:138)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/keras/engine/training._standardize_user_data(training.py:751)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/keras/engine/training.predict_on_batch(training.py:1268)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/delft/sequenceLabelling/tagger.tag(tagger.py:85)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/delft/sequenceLabelling/wrapper.tag(wrapper.py:391)
Oct 20 18:12:36 falcon docker[22147]: ! at <string>.<module>(<string>:1)
Oct 20 18:12:36 falcon docker[22147]: ! at jep.Jep.getValue(Native Method)
Oct 20 18:12:36 falcon docker[22147]: ! at jep.Jep.getValue(Jep.java:487)
Oct 20 18:12:36 falcon docker[22147]: ! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:127)
Oct 20 18:12:36 falcon docker[22147]: ! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:81)
Oct 20 18:12:36 falcon docker[22147]: ! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Oct 20 18:12:36 falcon docker[22147]: ! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Oct 20 18:12:36 falcon docker[22147]: ! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Oct 20 18:12:36 falcon docker[22147]: ! at java.lang.Thread.run(Thread.java:748)
Oct 20 18:12:40 falcon docker[22147]: 144.213.194.39 - - [20/Oct/2021:09:12:40 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 49415 "http://falcon.nims.go.jp/grobidl/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:93.0) Gecko/20100101 Firefox/93.0" 14572
lfoppiano commented 2 years ago

Except fulltext and segmentation models (too large sequence input size), all the Grobid models exists in a deep learning implementation but not in every architectures.

I wrote documentation, but I meant "documentation through configuration" 😅

kermitt2 commented 2 years ago

About the BidLSTM_CRF_FEATURES header model, it needs to be updated indeed in the current master and in the SNAPSHOT docker image. I've removed a feature in July but not updated the model with this architecture... I am only testing everything when I make a stable release :/