kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.44k stars 443 forks source link

Python Grobid client gives incorrect results while curl give correct ones. #886

Closed anafandon closed 2 years ago

anafandon commented 2 years ago

Dear grobid team,

I hope you are good and healthy. I'll jump straight to the problem.

INFO

version_used: docker image grobid/grobid:0.7.0

PROBLEM

For several pdfs the python grobid client give incorrect results (e.g. it returns title "Understanding Energy Absorption Behaviors of Nanoporous Materials" instead of "Understanding the Behaviors of BERT in Ranking" while when I tested with the curl request curl -v --form input=@./understanding_bert.pdf localhost:8070/api/processHeaderDocument I am getting the proper title "Understanding the Behaviors of BERT in Ranking"

LOGS The wierd thing are the following two: (1) I am getting a Warning for invalid header cookie and I don't know if this is something that cause the problem. (2) whenever I use the python client and I do a request after I check the logs, in reality I can see that the title found is the proper one!! :

INFO  [2022-01-20 18:45:44,103] org.grobid.core.utilities.crossref.CrossrefRequestTask:  (,query.title=Understanding the Behaviors of BERT in Ranking,rows=1,query.author=Qiao): .. executing
WARN  [2022-01-20 18:45:44,597] org.apache.http.client.protocol.ResponseProcessCookies: Invalid cookie header: "set-cookie: AWSALB=OgNuTVhb9PNA2Q9y1EmDUOeuLFKPgRUCsBkKDMm1LewasrQYL638y23ysQmhLSlunemMEX9mDG46hOvcCIiSuHaRFKv04VCE4u5bk139B7Jqbg7pPGQ3+JN7afbQ; Expires=Thu, 27 Jan 2022 18:45:44 GMT; Path=/". Invalid 'expires' attribute: Thu, 27 Jan 2022 18:45:44 GMT

Though instead the tei file returned has the wrong title:

<teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main">Understanding Energy Absorption Behaviors of Nanoporous Materials</title>
            </titleStmt>
            <publicationStmt>
                <publisher>Defense Technical Information Center</publisher>
                <availability status="unknown"><p>Copyright Defense Technical Information Center</p>
                </availability>
                <date type="published" when="2008-05-23">2008-05-23</date>

While whenever I do a curl request the logs looks normal:

WARN  [2022-01-20 19:15:05,974] org.grobid.core.utilities.GrobidProperties: No configuration parameter defined for DeLFT engine for model fulltext
WARN  [2022-01-20 19:15:05,974] org.grobid.core.utilities.GrobidProperties: No configuration parameter defined for DeLFT engine for model segmentation
INFO  [2022-01-20 19:15:05,974] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/6
127.0.0.1 - - [20/Jan/2022:19:15:06 +0000] "POST /api/processHeaderDocument HTTP/1.1" 200 2895 "-" "curl/7.64.1" 351

and the tei file returned has the correct title:

<fileDesc>
            <titleStmt>
                <title level="a" type="main">Understanding the Behaviors of BERT in Ranking</title>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown"><licence/></availability>
            </publicationStmt>
            <sourceDesc>

YAML FILE Also, In the yaml file I am using the consolidation service: "crossref" (i am mention this since, it think it might have something to do with it)

grobid:
  # where all the grobid_stuff_v1_2 resources are stored (models, lexicon, native libraries, etc.), normally no need to change
  grobidHome: "grobid-home"

  # path relative to the grobid-home path (e.g. grobid-home/tmp)
  temp: "tmp"

  # normally nothing to change here, path relative to the grobid-home path (e.g. grobid-home/lib)
  nativelibrary: "lib"

  pdf:
    pdfalto:
      # path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
      path: "pdfalto"
      # security for PDF parsing
      memoryLimitMb: 16096
      timeoutSec: 60

    # security relative to the PDF parsing result
    blocksMax: 100000
    tokensMax: 1000000

  consolidation:
    # define the bibliographical data consolidation service to be used, either "crossref" for CrossRef REST API or
    # "glutton" for https://github.com/kermitt2/biblio-glutton
    service: "crossref" #we use crossref otherwise after some hundreds of call will go in time out
    #service: "glutton"
    glutton:
      url: "https://cloud.science-miner.com/glutton"
      #url: "http://localhost:8080"
    crossref:
      mailto: "mycoolemail@superlolakos.com"
      # to use crossref web API, you need normally to use it politely and to indicate an email address here, e.g.
      #mailto: "toto@titi.tutu"
      token:
      # to use Crossref metadata plus service (available by subscription)
      #token: "yourmysteriouscrossrefmetadataplusauthorizationtokentobeputhere"

  proxy:
    # proxy to be used when doing external call to the consolidation service
    host:
    port:

  # CORS configuration for the GROBID web API service
  corsAllowedOrigins: "*"
  corsAllowedMethods: "OPTIONS,GET,PUT,POST,DELETE,HEAD"
  corsAllowedHeaders: "X-Requested-With,Content-Type,Accept,Origin"

  # the actual implementation for language recognition to be used
  languageDetectorFactory: "org.grobid.core.lang.impl.CybozuLanguageDetectorFactory"

  # the actual implementation for optional sentence segmentation to be used (PragmaticSegmenter or OpenNLP)
  #sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"
  sentenceDetectorFactory: "org.grobid.core.lang.impl.OpenNLPSentenceDetectorFactory"

  # maximum concurrency allowed to GROBID server for processing parallel requests - change it according to your CPU/GPU capacities
  # for a production server running only GROBID, set the value slightly above the available number of threads of the server
  # to get best performance and security

  concurrency: 6
  # when the pool is full, for queries waiting for the availability of a grobid_stuff engine, this is the maximum time wait to try
  # to get an engine (in seconds) - normally never change it
  poolMaxWait: 1

  delft:
    # DeLFT global parameters
    # delft installation path if Deep Learning architectures are used to implement one of the sequence labeling model,
    # embeddings are usually compiled as lmdb under delft/data (this paramter is ignored if only featured-engineered CRF are used)
    install: "../delft"
    pythonVirtualEnv:

  wapiti:
    # Wapiti global parameters
    # number of threads for training the wapiti models (0 to use all available processors)
    nbThreads: 0

  models:
    # we configure here how each sequence labeling model should be implemented
    # for feature-engineered CRF, use "wapiti" and possible training parameters are window, epsilon and nbMaxIterations
    # for Deep Learning, use "delft" and select the target DL architecture (see DeLFT library), the training
    # parameters then depends on this selected DL architecture

    - name: "segmentation"
      # at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0000001
        window: 50
        nbMaxIterations: 2000

    - name: "fulltext"
      # at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0001
        window: 20
        nbMaxIterations: 1500

    - name: "header"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.000001
        window: 30
        nbMaxIterations: 1500
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 1

    - name: "reference-segmenter"
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "name-header"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "name-citation"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "date"
      engine: "wapiti"
      #engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "figure"
      engine: "wapiti"
      #engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "table"
      engine: "wapiti"
      #engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 20
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "affiliation-address"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"

    - name: "citation"
      engine: "delft"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.00001
        window: 50
        nbMaxIterations: 3000
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        #architecture: "scibert"
        useELMo: false
        embeddings_name: "glove-840B"
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 20
        training:
          # parameters used for training
          max_sequence_length: 3000
          batch_size: 30

    - name: "patent-citation"
      engine: "wapiti"
      wapiti:
        # wapiti training parameters, they will be used at training time only
        epsilon: 0.0001
        window: 20
  # for **service only**: how to load the models,
  # false -> models are loaded when needed (default), avoiding putting in memory useless models but slow down significantly
  #          the service at first call
  # true -> all the models are loaded into memory at the server startup, slow the start of the services and models not
  #         used will take some memory, but server is immediatly warm and ready
  modelPreload: true

server:
    type: custom
    applicationConnectors:
    - type: http
      port: 8070
    adminConnectors:
    - type: http
      port: 8071
    registerDefaultExceptionMappers: false

logging:
  level: INFO
  loggers:
    org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
  appenders:
    - type: console
      threshold: ALL
      timeZone: UTC
    - type: file
      currentLogFilename: logs/grobid-service.log
      threshold: ALL
      archive: true
      archivedLogFilenamePattern: logs/grobid-service-%d.log
      archivedFileCount: 5
      timeZone: UTC

QUESTION

What do you think is happening? :)

lfoppiano commented 2 years ago

@anafandon thank you for the interesting case.

I did not manage to test it myself, however the cause seems related to the consolidation service. The grobid client is requesting the header consolidation (could you please confirm that?), grobid uses crossref which, in this case, seems returning the wrong result.

In fact, if I use the title + first author last name on crossref I obtain the same wrong result as you are pointing out: https://search.crossref.org/?from_ui=&q=Understanding+the+Behaviors+of+BERT+in+Ranking+Qiao

You could test by calling grobid via cURL using the consolidation service (ref): curl -v --form consolidateHeader=1 --form input=@./understanding_bert.pdf localhost:8070/api/processHeaderDocument and see whether you obtain the same wrong result.

In general, if you are working to process large amounts of PDF documents, we recommend not to use Crossref in favour of biblio-glutton which offer slightly better results (ref) and less traffic limitations than the overloaded Crossref service.

anafandon commented 2 years ago

Hi, @lfoppiano , thanks a lot for the instant response.

I checked the curl command you said and I get also the wrong title as well in the curl now. And actually turning the consolidateHeader to 0 in my python api, gives me the correct result I want.

In fact, it seems this is exactly what I needed, I never wanted to use CrossRef. I needed just grobid to give me the results without any consolidation at all, since I have my own system to do it that.

I consider my issue closed.

Thanks a lot, and again, congrats for all the hard work you'd put to grobid till now!

Cheers

kermitt2 commented 2 years ago

Hi @anafandon

To complement Luca's answer:

Drawback of biblio-glutton: it's very heavy to install because it is indexing the whole CrossRef metadata

anafandon commented 2 years ago

Thanks a lot for your response as well @kermitt2 :)

I actually read quite few times the documentation, since you have it well written. The problem is that if you are newbie it takes time to "digest" the concept of consolidation and set your expectations accordingly. The info is there, but your brain takes time to comprehend it!

Regarding biblio-glutton I actually have already an huge elasticsearch cluster with millions of metadata, so now I can do the consolidation myself. Though, what would be helpful from the biblio-glutton's github, would be if you can point me to the file that you actually perform the elasticsearch query for finding the proper reference. I tried to find it but I couldn't because I am mostly native to python.