kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 457 forks source link

Split proceedings PDF in multiple articles #80

Open jacquerie opened 9 years ago

jacquerie commented 9 years ago

Hi, and thanks for this awesome library.

Our users have asked for a feature: use GROBID to split a proceedings PDF in its articles. Maybe the output could be a TEI enriched with some tag to separate different articles. Example:

<TEI>
  <teiArticle>
    <teiHeader>
      [...]
    </teiHeader>
    <text>
      [...]
    </text>
  </teiArticle>
</TEI>

Another idea would be simply returning a list of numbers, representing the pages at which new articles start.

What do you think about this feature?

kermitt2 commented 9 years ago

Thank you for the feedback!

This is indeed an additional service that was already requested. There is in GROBID a model called ebook which was introduced to perform something a little bit more general - to split an ebook into chapters. Segmenting some proceedings into individual articles could be seen as a particular version of the task.

Unfortunately this work was just an experiment and is far from being usable... There are some issues. For instance, in some proceedings, the articles can start in the middle of the page. Another one is the table of content which should be parsed and used to validate the article segmentation. The process also needs to be fast, and cannot be based on too small tokens (like words), but blocks.

In TEI, the conformant encoding for proceedings could be based on <teiCorpus> containing its own <teiHeader> a list of <TEI> (one per article).

datablend commented 8 years ago

Somewhat related to this Issue. Some older scientific PDF's contain articles that start halfway the first page (the references of another article are preceding it). Grobid does not cleanly handle these cases. Is this planned?

kermitt2 commented 8 years ago

Hello! This is a bit different, and indeed not really covered at this stage by GROBID. The model segmentation is in charge of the overall segmentation of an article in header, body, footnotes, bibliographical sections, annexes, etc. It's probably the best place to tackle seriously this problem. Currently the training data for the segmentation model do not cover these cases, so it may work but really by chance ;)

This would be complementaty to an general segmentation in "chapters/articles" taking as input a complete proceedings volume or a journal issue.

Don't hesitate to open another specific issue for this case, to make it more visible and to keep track of the progress on this!