kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.46k stars 446 forks source link

Error case for footnote on abstract #1048

Open lfoppiano opened 1 year ago

lfoppiano commented 1 year ago

Here another document with segmentation issues: 3834-Article Text-6892-1-10-20190701.pdf

In particular, the first footnote at the end of the abstract, is lost. The segmentation model classify it as <header>:

Copyright   c   copyright   C   Co  Cop Copy    BLOCKSTART  PAGEIN  SAMEFONT    LOWERFONT   0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   0   7   no  0   2   0   1   0   0   1   I-<header>
⃝   2019,   ⃝   ⃝   ⃝   ⃝   ⃝   BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   0   7   ,   1   9   0   1   0   0   1   <header>
Intelligence    (www.aaai.org). intelligence    I   In  Int Inte    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   0   7   (..)..  6   8   0   1   0   0   1   <header>
1   https://stanfordmlgroup.github.io/competitions/chexpert 1   1   1   1   1   BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    1   0   0   0   0   0   0   0   0   7   ://..// 7   10  0   1   0   0   1   <header>

Then is tagged as <copyright>:

1   1   1   1   1   1   1   1   1   1   BLOCKIN LINESTART   LINEINDENT  SAMEFONT    LOWERFONT   0   0   NOCAPS  ALLDIGIT    1   0   0   0   0   1   0   0   NOPUNCT 0   0   1   0   <copyright>
https   https   h   ht  htt http    s   ps  tps ttps    BLOCKIN LINEIN  LINEINDENT  SAMEFONT    HIGHERFONT  0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
:   :   :   :   :   :   :   :   :   :   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   PUNCT   0   0   1   0   <copyright>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
stanfordmlgroup stanfordmlgroup s   st  sta stan    p   up  oup roup    BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   DOT 0   0   1   0   <copyright>
github  github  g   gi  git gith    b   ub  hub thub    BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   DOT 0   0   1   0   <copyright>
io  io  i   io  io  io  o   io  io  io  BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   1   0   1   NOPUNCT 0   0   1   0   <copyright>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
competitions    competitions    c   co  com comp    s   ns  ons ions    BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>
chexpert    chexpert    c   ch  che chex    t   rt  ert pert    BLOCKEND    LINEEND LINEINDENT  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <copyright>

Now the interesting thing is that on arXIv (https://arxiv.org/abs/1901.07031) is stated as CC-BY, however both on the paper and on the landing page (https://ojs.aaai.org/index.php/AAAI/article/view/3834) it's stating "copyright ..blablabla". So I assume we cannot use it safely for training data.

kermitt2 commented 1 year ago

In particular, the first footnote at the end of the abstract, is lost. The segmentation model classify it as <header>.

Ahhh normally, footnotes related to the header part would indeed be attached to the header by the segmentation model, from the guidelines:

However, any footnotes referenced from within the <body> should remain outside the header element, even if they are on the first page or surrunded by <front> fragments.

So if the footnote is called from within the body part, it is outside the header, otherwise it is in the header.

However, there is no mechanism implemented to attach such footnote in abstract, nor header training data currently covering this case, and it is misclassified as copyrights.

Now the interesting thing is that on arXIv (https://arxiv.org/abs/1901.07031) is stated as CC-BY, however both on the paper and on the landing page (https://ojs.aaai.org/index.php/AAAI/article/view/3834) it's stating "copyright ..blablabla". So I assume we cannot use it safely for training data.

Licenses info apply to a file, not to a general article with all its different versions.

ArXiv has the preprint with author's copyrights - the authors choose the license, usually only arXiv license, but here CC-BY. The publisher version is a different document, with publisher's copyrights and the publisher has chosen to retain its copyrights to prevent redistribution.

For this reason, for training data, we have to select the right file version.

lfoppiano commented 1 year ago

ArXiv has the preprint with author's copyrights - the authors choose the license, usually only arXiv license, but here CC-BY. The publisher version is a different document, with publisher's copyrights and the publisher has chosen to retain its copyrights to prevent redistribution.

For this reason, for training data, we have to select the right file version.

OK, if I understood correctly, we should expect that the ArXiv version to be different from the publisher version. In this case the file is the same and there is a single version.

kermitt2 commented 1 year ago

OK, if I understood correctly, we should expect that the ArXiv version to be different from the publisher version. In this case the file is the same and there is a single version.

Yes we can expect a different version, but not necessarily, there are usually 3 versions for a published paper:

Depending on the state of the paper, the copyrights holder can be different and the paper file has different license.

Then even with the publisher's version not in CC-BY, my understanding is that the publisher can allow a deposit by the authors on a preprint server, but the copyrights and license normally remain (not further sharable outside the archive server). This is possibly the case for this paper. Or more likely the authors simply put the final paper on arXiv not following the publisher license, because nobody care (not sure about the moderation for arXiv, but on HAL for instance, the deposit would be removed).

On HAL, all preprints can be deposited (EU legal framework), as well as all postprints not under embargo, and the publisher's version cannot be deposited except permission by the publisher or sharable license selected by the publisher or paid by the authors (gold open access).

From the publisher site for this conference proceedings:

Copyright to individual papers as well as the proceedings as a whole is fully owned by the Association for the Advancement of Artificial Intelligence. Permission is required for republication. Please consult the AAAI copyright form for details.
lfoppiano commented 1 year ago

Then even with the publisher's version not in CC-BY, my understanding is that the publisher can allow a deposit by the authors on a preprint server, but the copyrights and license normally remain (not further sharable outside the archive server). This is possibly the case for this paper. Or more likely the authors simply put the final paper on arXiv not following the publisher license, because nobody care (not sure about the moderation for arXiv, but on HAL for instance, the deposit would be removed).

OK, this makes sense and I'm glad I understood correctly 😸 It seems not safe to use this paper as training data 😭