kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 443 forks source link

Question about consolidation behavior #1140

Open oborin1 opened 1 month ago

oborin1 commented 1 month ago

I have moved from CrossRef API consolidation to biblio-glutton with CrossRef database loaded as described in its documentation few weeks ago and met a difference in its behavior. Earlier with CrossRef API for consolidation, I have received the bibtex data for the translated version of an article using its transliterated original, which is desired. I suppose that the consolidation was based on the author string and year.

<span>@</span>article{15,
  author = {Bersenev, I S and Bragin, V V and Evstyugin, S N and Petryshev, A Yu and Pigarev, S P and Pokolenko, A Yu},
  title = {Evolution of structure and metallurgical properties of iron ore pellets when fluxing with dolomite, JSC Mikhailovsky GOK named after A.V. Varichev},
  journal = {Steel in Translation},
  publisher = {Allerton Press},
  date = {2020-11},
  year = {2020},
  month = {11},
  pages = {788-794},
  volume = {50},
  number = {11},
  doi = {10.3103/s0967091220110054},
  raw = {16. I.S. Bersenev, V.V. Bragin, S.N. Evstyugin i dr. Evolyutsiya struktury i metallurgicheskikh svoistv zhelezorudnykh okatyshei AO «MGOK im. A.V. Varicheva» pri oflyusovanii dolomitom // Stal'. 2020. № 11. S. 11 – 17.}
}

Unfortunately, biblio-glutton now yields a different result:

<span>@</span>article{0,
  author = {Bersenev, I S and Bragin, V V and Evstyugin I Dr, S N},
  title = {Evolyutsiya struktury i metallurgicheskikh svoistv zhelezorudnykh okatyshei AO},
  journal = {MGOK im. A.V. Varicheva» pri oflyusovanii dolomitom // Stal},
  date = {2020},
  year = {2020},
  pages = {11--17},
  volume = {11},
  raw = {I.S. Bersenev, V.V. Bragin, S.N. Evstyugin i dr. Evolyutsiya struktury i metallurgicheskikh svoistv zhelezorudnykh okatyshei AO «MGOK im. A.V. Varicheva» pri oflyusovanii dolomitom // Stal'. 2020. № 11. S. 11 – 17}
}

How can I adjust the consolidation behavior of the biblio-glutton method?

If needed, my OS is Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-105-generic x86_64) and my java version is 17.0.6.

The consolidation with biblio-glutton is still possible with the data of the translated version:

<span>@</span>article{0,
  author = {Bersenev, I S and Bragin, V V and Evstyugin, S N and Petryshev, A Yu.  and Pigarev, S P and Pokolenko, A Yu. },
  title = {Evolution of Structure and Metallurgical Properties of Iron Ore Pellets When Fluxing with Dolomite, JSC Mikhailovsky GOK Named after A.V. Varichev},
  journal = {Steel in Translation},
  publisher = {Allerton Press},
  date = {2020-11},
  year = {2020},
  month = {11},
  pages = {788-794},
  volume = {50},
  number = {11},
  doi = {10.3103/s0967091220110054},
  raw = {Bersenev, I.S., Bragin, V.V., Evstyugin, S.N., Petryshev, A.Yu., Pigarev, S.P., and Pokolenko, A.Yu., Evolution of structure and metallurgical properties of iron ore pellets when fluxing with dolomite, JSC Mikhailovsky GOK named after A.V. Varichev, Steel in Translation, 2020, vol. 50, no. 11, pp. 788-794.}
}
lfoppiano commented 1 month ago

Hi @oborin1, thanks for your report.

I have a few questions.

Did you obtain the original version when you process the file through grobid? Or you call directly biblio-glutton?

If in the case of Grobid, could you provide the logs?

I just tested via firstAuthor + title and biblio glutton yield the correct result so we should understand which query was sent to biblio-glutton.

Here my example (just a reference on the query, as the server might be down, as it's a on-demand GC service)

http://34.28.170.80/glutton/service/lookup?firstAuthor=Bersenev&atitle=Evolution%20of%20structure%20and%20metallurgical%20properties%20of%20iron%20ore%20pellets%20when%20fluxing%20with%20dolomite%2C%20JSC%20Mikhailovsky%20GOK%20named%20after%20A.V.%20Varichev

{"URL":"http://dx.doi.org/10.3103/s0967091220110054","resource":{"primary":{"URL":"http://link.springer.com/10.3103/S0967091220110054"}},"member":"1627","score":0.0,"created":{"date-parts":[[2021,3,11]],"date-time":"2021-03-11T15:05:38Z","timestamp":1615475138000},"update-policy":"http://dx.doi.org/10.1007/springer_crossmark_policy","license":[{"start":{"date-parts":[[2020,11,1]],"date-time":"2020-11-01T00:00:00Z","timestamp":1604188800000},"content-version":"tdm","delay-in-days":0,"URL":"http://www.springer.com/tdm"},{"start":{"date-parts":[[2020,11,1]],"date-time":"2020-11-01T00:00:00Z","timestamp":1604188800000},"content-version":"vor","delay-in-days":0,"URL":"http://www.springer.com/tdm"}],"ISSN":["0967-0912","1935-0988"],"container-title":["Steel in Translation"],"issued":{"date-parts":[[2020,11]]},"issue":"11","prefix":"10.3103","reference-count":16,"author":[{"given":"I. S.","family":"Bersenev","sequence":"first","affiliation":[]},{"given":"V. V.","family":"Bragin","sequence":"additional","affiliation":[]},{"given":"S. N.","family":"Evstyugin","sequence":"additional","affiliation":[]},{"given":"A. Yu.","family":"Petryshev","sequence":"additional","affiliation":[]},{"given":"S. P.","family":"Pigarev","sequence":"additional","affiliation":[]},{"given":"A. Yu.","family":"Pokolenko","sequence":"additional","affiliation":[]}],"DOI":"10.3103/s0967091220110054","is-referenced-by-count":8,"published":{"date-parts":[[2020,11]]},"published-print":{"date-parts":[[2020,11]]},"alternative-id":["1284"],"published-online":{"date-parts":[[2021,3,11]]},"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"title":["Evolution of Structure and Metallurgical Properties of Iron Ore Pellets When Fluxing with Dolomite, JSC Mikhailovsky GOK Named after A.V. Varichev"],"link":[{"URL":"http://link.springer.com/content/pdf/10.3103/S0967091220110054.pdf","content-type":"application/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http://link.springer.com/article/10.3103/S0967091220110054/fulltext.html","content-type":"text/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http://link.springer.com/content/pdf/10.3103/S0967091220110054.pdf","content-type":"application/pdf","content-version":"vor","intended-application":"similarity-checking"}],"source":"Crossref","type":"journal-article","publisher":"Allerton Press","journal-issue":{"issue":"11","published-print":{"date-parts":[[2020,11]]}},"volume":"50","references-count":16,"issn-type":[{"value":"0967-0912","type":"print"},{"value":"1935-0988","type":"electronic"}],"assertion":[{"value":"12 October 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 March 2021","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"deposited":{"date-parts":[[2021,3,11]],"date-time":"2021-03-11T15:22:43Z","timestamp":1615476163000},"language":"en","page":"788-794","short-container-title":["Steel Transl."]}
oborin1 commented 1 month ago

Hi @lfoppiano,

thank you for your reply!

In all cases I started GROBID servers and called the Python client to process the references with the corresponding changes in the server's configuration. (Having tried to get logs within my running GROBID container, I found out that logging is different in grobid.yaml and grobid-full.yaml by default; is it intended?) The grobid logs attached just say that a particular consolidation wasn't successful. grobid-service.log

If I directly call biblio-glutton with the curl request

curl "http://localhost:8080/service/lookup?biblio=I.S.+Bersenev,+V.V.+Bragin,+S.N.+Evstyugin+i+dr.+Evolyutsiya+struktury+i+metallurgicheskikh+svoistv+zhelezorudnykh+okatyshei+AO+«MGOK+im.+A.V.+Varicheva»+pri+oflyusovanii+dolomitom+//+Stal'.+2020.+№+11.+S.+11+–+17"

it returns {"message":"Best bibliographical record did not passed the post-validation"}

In cases with the raw string of the translated version (available in CrossRef database), it yields the correct result.

Maybe, the post-validation is the key to the solution I seek?

lfoppiano commented 1 month ago

Ok maybe now I understand better. Are you processing a PDF document with the original bibliographic data?

Then I think crossref returns the translated version, and with biblio-glutton you don't get any condolidation because of the post-validation. The post-validation is a mechanism to avoid false positive, when results from biblio-glutton and the input are too different, therefore biblio-glutton prefer to abort the consolidation than to return wrong results.

oborin1 commented 1 month ago

@lfoppiano, thank you for your answers. Is LookupEngine.java the place to dig further? I am processing text files with reference strings, rather than PDFs. (My intention is to process reference lists for publication, so any mistakes introduced by PDFs are annoying.) Have I understood it correctly that GROBID has no post-validation mechanism when it receives the results from the CrossRef API?

lfoppiano commented 1 month ago

Is LookupEngine.java the place to dig further?

I think so, definitely you can track it down starting from the contoller. Feel free to open a specific issue on the biblio-glutton repo.

I am processing text files with reference strings, rather than PDFs. (My intention is to process reference lists for publication, so any mistakes introduced by PDFs are annoying.)

OK so do you call directly biblio-glutton, without grobid?

Have I understood it correctly that GROBID has no post-validation mechanism when it receives the results from the CrossRef API?

Actually is the other way around, grobid is not responsible on the quality of the retrieval, so it does assume that everything that is returned is the best possible matching. When we wrote biblio-glutton we decided to do not answer, rather than answer something completely wrong. So, I would say it's a feature in biblio-glutton that I'm not sure crossref has 😉