kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

Closed JosVuHuynh closed 3 weeks ago

JosVuHuynh commented 3 months ago

What is the correct format for a PDF file that GROBID can detect references in? I create PDFs myself, and sometimes they work and sometimes they don’t. I’m not sure about the formatting rules. Can you please let me know?

lfoppiano commented 3 months ago

With "detect references" do you mean, detect reference callout (e.g. In previous work [1] we showed that...)? or references sections in the article?

For the first case, there is generally not much training data in grobid (Fulltext model), but maybe it's easier if you show me some examples of your generated documents.

JosVuHuynh commented 3 months ago

GwptVMUJQT.pdf T5D17Q7WMj.pdf besG09DFZb.pdf CsoUOcdybT.pdf Could you review all files @lfoppiano ? Grobid not detect ref when I run on https://huggingface.co/spaces/kermitt2/grobid .| It related issues: https://github.com/kermitt2/grobid/issues/1152

I would like to know the formatting rules I need to follow when creating a new article PDF so that GROBID can accurately detect citations.

lfoppiano commented 3 months ago

There are no "rules" to format a document so that Grobid recognise the references. It's more like, to make a document like a scientific article. At a first glance, these document' format is a bit far from the layout of a scientific article. For example, there is no header (at least title and authors) and the page layout is also horizontal (landscape).

Then, most important, the references don't match the text, so is normal that Grobid does not extract them correctly.

I did adjust your document and now with some more consistency looks much better ;-) Although, the body look indeed like an abstract: Untitled.pdf Untitled.pdf.tei.xml.zip