kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.61k stars 461 forks source link

`processHeaderDocument` returns BibTeX by default instead of TEI #1093

Open michamos opened 8 months ago

michamos commented 8 months ago

Hi, I noticed that, at least since v0.7.3, GROBID started returning bibtex by default for /api/processHeaderDocument. This contradicts https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessheaderdocument which claims a special Accept: application/x-bibtex header must be used for BibTeX and that the default is TEI XML.

Note that it's possible to get an XML response by using Accept: application/xml.

Steps to reproduce

  1. Get a PDF (I used https://arxiv.org/pdf/2212.12604v1.pdf but anything will do)
  2. Make a request against the GROBID API. I used the HuggingFace demo API: curl https://kermitt2-grobid.hf.space/api/processHeaderDocument --form input=@Downloads/2212.12604v1.pdf
  3. See that the output contains BibTeX and not TEI XML:
    @misc{-1,
    author = {},
    title = {Search for new physics in the τ lepton plus missing transverse momentum final state in proton-proton collisions at √ s = 13 TeV The CMS Collaboration},
    date = {2022-12-23},
    year = {2022},
    month = {12},
    day = {23},
    eprint = {arXiv:2212.12604v1[hep-ex]},
    abstract = {A search for physics beyond the standard model (SM) in the final state with a hadronically decaying tau lepton and a neutrino is presented. This analysis is based on data recorded by the CMS experiment from proton-proton collisions at a center-ofmass energy of 13 TeV at the LHC, corresponding to a total integrated luminosity of 138 fb-1. The transverse mass spectrum is analyzed for the presence of new physics. No significant deviation from the SM prediction is observed. Limits are set on the production cross section of a W boson decaying into a tau lepton and a neutrino. Lower limits are set on the mass of the sequential SM-like heavy charged vector boson and the mass of a quantum black hole. Upper limits are placed on the couplings of a new boson to the SM fermions. Constraints are put on a nonuniversal gauge interaction model and an effective field theory model. For the first time, upper limits on the cross section of t-channel leptoquark (LQ) exchange are presented. These limits are translated into exclusion limits on the LQ mass and on its coupling in the t-channel. The sensitivity of this analysis extends into the parameter space of LQ models that attempt to explain the anomalies observed in B meson decays. The limits presented for the various interpretations are the most stringent to date. Additionally, a model-independent limit is provided.}
    }

Requested info

Linux amd64 through lfoppiano/grobid:0.7.3 Docker image & whatever huggingface is using

openjdk 17.0.2 2022-01-18 OpenJDK Runtime Environment (build 17.0.2+8-86) OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)

lfoppiano commented 7 months ago

Hi @michamos, long time not see 😄 It's nice that you're back working with Grobid? Thanks for opening the issue.

It seems more a problem due to how Jakarta selects the default when Accept is not specified. In local, when I use the same request you posted, I get TEI-XML, however I think it depends how the methods are loaded. It seems that there is no clear behaviour, althought this looks strange.

One solution I saw is to add an additional filter to default the Accept to application/xml when undefined, but it seems a bit of a hack and might affect other endpoints.

I will check it out a bit more in detail

michamos commented 7 months ago

Hi @lfoppiano, indeed :) We've been using GROBID in prod for INSPIRE for quite a while now. We use it to extract author and affiliation info from PDFs and to segment references for interactive search (so users can copy/paste references from a paper and it magically works). Unfortunately, our current resources are very limited, so we can't really contribute beyond submitting bug reports.

Thanks for looking into the issue!

lfoppiano commented 6 months ago

I dug into this and did not find a clean solution. I'm quite surprised that there is no way to define a default behavior. It seems that the behavior is random depending on the platform where it's running.

Nevertheless, I updated the documentation, though, stating that the Accept header is required.