Closed anafandon closed 1 year ago
Dear @anafandon, could you please share the configuration file?
Hi, @lfoppiano and thanks a lot of the response. What I only changed in the yaml file is that I commented the "engine wapiti" and I de-commented the "engine delft".
What are your thoughts?
Below is my yaml file:
# this is the configuration file for the GROBID instance
grobid:
# where all the Grobid resources are stored (models, lexicon, native libraries, etc.), normally no need to change
grobidHome: "grobid-home"
# path relative to the grobid-home path (e.g. grobid-home/tmp)
temp: "tmp"
# normally nothing to change here, path relative to the grobid-home path (e.g. grobid-home/lib)
nativelibrary: "lib"
pdf:
pdfalto:
# path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
path: "pdfalto"
# security for PDF parsing
memoryLimitMb: 6096
timeoutSec: 60
# security relative to the PDF parsing result
blocksMax: 100000
tokensMax: 1000000
consolidation:
# define the bibliographical data consolidation service to be used, either "crossref" for CrossRef REST API or
# "glutton" for https://github.com/kermitt2/biblio-glutton
#service: "crossref"
service: "glutton"
glutton:
url: "https://cloud.science-miner.com/glutton"
#url: "http://localhost:8080"
crossref:
mailto:
# to use crossref web API, you need normally to use it politely and to indicate an email address here, e.g.
#mailto: "toto@titi.tutu"
token:
# to use Crossref metadata plus service (available by subscription)
#token: "yourmysteriouscrossrefmetadataplusauthorizationtokentobeputhere"
proxy:
# proxy to be used when doing external call to the consolidation service
host:
port:
# CORS configuration for the GROBID web API service
corsAllowedOrigins: "*"
corsAllowedMethods: "OPTIONS,GET,PUT,POST,DELETE,HEAD"
corsAllowedHeaders: "X-Requested-With,Content-Type,Accept,Origin"
# the actual implementation for language recognition to be used
languageDetectorFactory: "org.grobid.core.lang.impl.CybozuLanguageDetectorFactory"
# the actual implementation for optional sentence segmentation to be used (PragmaticSegmenter or OpenNLP)
#sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"
sentenceDetectorFactory: "org.grobid.core.lang.impl.OpenNLPSentenceDetectorFactory"
# maximum concurrency allowed to GROBID server for processing parallel requests - change it according to your CPU/GPU capacities
# for a production server running only GROBID, set the value slightly above the available number of threads of the server
# to get best performance and security
concurrency: 10
# when the pool is full, for queries waiting for the availability of a Grobid engine, this is the maximum time wait to try
# to get an engine (in seconds) - normally never change it
poolMaxWait: 1
delft:
# DeLFT global parameters
# delft installation path if Deep Learning architectures are used to implement one of the sequence labeling model,
# embeddings are usually compiled as lmdb under delft/data (this paramter is ignored if only featured-engineered CRF are used)
install: "../delft"
pythonVirtualEnv:
wapiti:
# Wapiti global parameters
# number of threads for training the wapiti models (0 to use all available processors)
nbThreads: 0
models:
# we configure here how each sequence labeling model should be implemented
# for feature-engineered CRF, use "wapiti" and possible training parameters are window, epsilon and nbMaxIterations
# for Deep Learning, use "delft" and select the target DL architecture (see DeLFT library), the training
# parameters then depends on this selected DL architecture
- name: "segmentation"
# at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0000001
window: 50
nbMaxIterations: 2000
- name: "fulltext"
# at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0001
window: 20
nbMaxIterations: 1500
- name: "header"
#engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.000001
window: 30
nbMaxIterations: 1500
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
runtime:
# parameters used at runtime/prediction
max_sequence_length: 3000
batch_size: 1
- name: "reference-segmenter"
#engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 20
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "name-header"
#engine: "wapiti"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "name-citation"
#engine: "wapiti"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "date"
#engine: "wapiti"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "figure"
#engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 20
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "table"
#engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 20
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "affiliation-address"
#engine: "wapiti"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "citation"
#engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 50
nbMaxIterations: 3000
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
#architecture: "scibert"
useELMo: false
embeddings_name: "glove-840B"
runtime:
# parameters used at runtime/prediction
max_sequence_length: 3000
batch_size: 20
training:
# parameters used for training
max_sequence_length: 3000
batch_size: 30
- name: "patent-citation"
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0001
window: 20
# for **service only**: how to load the models,
# false -> models are loaded when needed (default), avoiding putting in memory useless models but slow down significantly
# the service at first call
# true -> all the models are loaded into memory at the server startup, slow the start of the services and models not
# used will take some memory, but server is immediatly warm and ready
modelPreload: false
server:
type: custom
applicationConnectors:
- type: http
port: 8070
adminConnectors:
- type: http
port: 8071
registerDefaultExceptionMappers: false
logging:
level: INFO
loggers:
org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
appenders:
- type: console
threshold: ALL
timeZone: UTC
- type: file
currentLogFilename: logs/grobid-service.log
threshold: ALL
archive: true
archivedLogFilenamePattern: logs/grobid-service-%d.log
archivedFileCount: 5
timeZone: UTC
In brief, not all the models used in Grobid exists for deep learning and few are not available in the architecture that is selected by default in the configuration file. We need to improve this part.
Meanwhile, could you please change back to wapiti the following models:
and try again?
PS: for the sake of having the comments in order I'm re-posting my answer here.
@lfoppiano Sorry I mistakenly tapped the "close with comment" .
So, I did what you said but had no effect, I get the same error. I played a bit around by changing back to wapiti the
It worked, but I'd guess that was the expected since I tried to extract the header
curl -v --form input=@./nogueira.pdf localhost:8080/api/processHeaderDocument
An other interesting issue arisen is that with "processFulltextDocument" stopped working completely.
ubuntu@stergiaras-react-dev:~/grobid_client_python/inpdfs$ curl -v --form input=@./nogueira.pdf localhost:8070/api/processFulltextDocument
* Trying 127.0.0.1:8070...
* TCP_NODELAY set
* connect to 127.0.0.1 port 8070 failed: Connection refused
* Failed to connect to localhost port 8070: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 8070: Connection refused`
And it start working again only after I put back everything to wapiti
So, I did what you said but had no effect, I get the same error. I played a bit around by changing back to wapiti the
* name: "header"
It worked, but I'd guess that was the expected since I tried to extract the header
curl -v --form input=@./nogueira.pdf localhost:8080/api/processHeaderDocument
mmm, that's correct, however, the header model should not pose any problem (I checked and all the models and architectures are available) ...
Anyway, considering also the problem with the fulltext extraction.
Could you restart the docker with the new configuration (leave the header as delft), run the processFulltext
and paste here the log again.
Two notes:
For reference I add my configuration which works with Deep Learning
- name: "segmentation"
# at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0000001
window: 50
nbMaxIterations: 2000
- name: "fulltext"
# at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0001
window: 20
nbMaxIterations: 1500
- name: "header"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.000001
window: 30
nbMaxIterations: 1500
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
runtime:
# parameters used at runtime/prediction
max_sequence_length: 3000
batch_size: 1
- name: "reference-segmenter"
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 20
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "name-header"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "name-citation"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "date"
engine: "wapiti"
#engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "figure"
engine: "wapiti"
#engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 20
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "table"
engine: "wapiti"
#engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 20
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "affiliation-address"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
- name: "citation"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.00001
window: 50
nbMaxIterations: 3000
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
#architecture: "scibert"
useELMo: false
embeddings_name: "glove-840B"
runtime:
# parameters used at runtime/prediction
max_sequence_length: 3000
batch_size: 20
training:
# parameters used for training
max_sequence_length: 3000
batch_size: 30
- name: "patent-citation"
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0001
window: 20
The configuration your provided still doesn't work for the processHeaderDocument, it gives me the same error. Though the fulltext is working.
everything works like charm! with your configuration :D:D:D
Thanks a lot for all the help @lfoppiano and your amazing work in grobid :)
Just pointing to the documentation (for the sake of the documenting effort!):
Each model has its own configuration indicating:
- for Deep Learning models, which neural architecture to be used, with choices normally among BidLSTM_CRF,
BidLSTM_CRF_FEATURES, bert-base-en and scibert. The corresponding model/architecture combination
need to be available under grobid-home/models/.
Except fulltext and segmentation models (too large sequence input size), all the Grobid models exist in a deep learning implementation but not in every architectures.
Actually, I'm afraid there is a problem with the header model with DeLFT...
By switching it back to wapiti
or using the version 0.7.0
the problem disappear:
Oct 20 18:12:26 falcon docker[22147]: INFO [2021-10-20 09:12:26,050] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
Oct 20 18:12:36 falcon docker[22147]: ERROR [2021-10-20 09:12:36,830] org.grobid.core.jni.DeLFTModel: DeLFT model labelling via JEP failed
Oct 20 18:12:36 falcon docker[22147]: ! jep.JepException: <class 'ValueError'>: Error when checking input: expected features_input to have shape (None, 22) but got array with shape (435, 21)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.standardize_input_data(training_utils.py:138)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/keras/engine/training._standardize_user_data(training.py:751)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/keras/engine/training.predict_on_batch(training.py:1268)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/delft/sequenceLabelling/tagger.tag(tagger.py:85)
Oct 20 18:12:36 falcon docker[22147]: ! at /usr/local/lib/python3.6/dist-packages/delft/sequenceLabelling/wrapper.tag(wrapper.py:391)
Oct 20 18:12:36 falcon docker[22147]: ! at <string>.<module>(<string>:1)
Oct 20 18:12:36 falcon docker[22147]: ! at jep.Jep.getValue(Native Method)
Oct 20 18:12:36 falcon docker[22147]: ! at jep.Jep.getValue(Jep.java:487)
Oct 20 18:12:36 falcon docker[22147]: ! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:127)
Oct 20 18:12:36 falcon docker[22147]: ! at org.grobid.core.jni.DeLFTModel$LabelTask.call(DeLFTModel.java:81)
Oct 20 18:12:36 falcon docker[22147]: ! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Oct 20 18:12:36 falcon docker[22147]: ! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Oct 20 18:12:36 falcon docker[22147]: ! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Oct 20 18:12:36 falcon docker[22147]: ! at java.lang.Thread.run(Thread.java:748)
Oct 20 18:12:40 falcon docker[22147]: 144.213.194.39 - - [20/Oct/2021:09:12:40 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 49415 "http://falcon.nims.go.jp/grobidl/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:93.0) Gecko/20100101 Firefox/93.0" 14572
Except fulltext and segmentation models (too large sequence input size), all the Grobid models exists in a deep learning implementation but not in every architectures.
I wrote documentation, but I meant "documentation through configuration" 😅
About the BidLSTM_CRF_FEATURES header model, it needs to be updated indeed in the current master and in the SNAPSHOT docker image. I've removed a feature in July but not updated the model with this architecture... I am only testing everything when I make a stable release :/
Hi, and congrats on your great work with Grobid. I managed to set up Grobid with the docker image you provide on this guide
Everything works perfect until I want to switch to deep learning models. I set up the yaml file as you mentioned in the guide and the I run the following command WITHOUT the --gpus all flag since my machine (linux) doesn't have gpus)
but the following error pops out: