kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.58k stars 459 forks source link

Format PDF to detect references #1152

Open HuynhVuInnomize opened 3 months ago

HuynhVuInnomize commented 3 months ago

Can you explain how to edit the PDF file format and what the correct format should be to detect references?

In attached image. Grobib just detect ref at first case (word "and" the same line with "Vlissides (1995)") and can not detect the 2nd and third case. Thank you. ! test_format

lfoppiano commented 2 months ago

@HuynhVuInnomize it depends on the paper, layout and other statistical factors. The model responsible for this extraction is the fulltext which has around 30 examples. Adding a few more training data with problematic cases, should help rapidly. If you can share the examples we are planning to work on the training data in future, we can include it, however, if you are in a hurry you can create and correct them on your own.