Open lfoppiano opened 1 year ago
In particular, the first footnote at the end of the abstract, is lost. The segmentation model classify it as
<header>
.
Ahhh normally, footnotes related to the header part would indeed be attached to the header by the segmentation model, from the guidelines:
However, any footnotes referenced from within the <body> should remain outside the header element, even if they are on the first page or surrunded by <front> fragments.
So if the footnote is called from within the body part, it is outside the header, otherwise it is in the header.
However, there is no mechanism implemented to attach such footnote in abstract, nor header training data currently covering this case, and it is misclassified as copyrights.
Now the interesting thing is that on arXIv (https://arxiv.org/abs/1901.07031) is stated as CC-BY, however both on the paper and on the landing page (https://ojs.aaai.org/index.php/AAAI/article/view/3834) it's stating "copyright ..blablabla". So I assume we cannot use it safely for training data.
Licenses info apply to a file, not to a general article with all its different versions.
ArXiv has the preprint with author's copyrights - the authors choose the license, usually only arXiv license, but here CC-BY. The publisher version is a different document, with publisher's copyrights and the publisher has chosen to retain its copyrights to prevent redistribution.
For this reason, for training data, we have to select the right file version.
ArXiv has the preprint with author's copyrights - the authors choose the license, usually only arXiv license, but here CC-BY. The publisher version is a different document, with publisher's copyrights and the publisher has chosen to retain its copyrights to prevent redistribution.
For this reason, for training data, we have to select the right file version.
OK, if I understood correctly, we should expect that the ArXiv version to be different from the publisher version. In this case the file is the same and there is a single version.
OK, if I understood correctly, we should expect that the ArXiv version to be different from the publisher version. In this case the file is the same and there is a single version.
Yes we can expect a different version, but not necessarily, there are usually 3 versions for a published paper:
preprint: version before peer review
postprint or Author's Accepted Manuscript (AAM): peer review done, complete, but not publisher formatted
publisher's version: like postprint but with publisher formatting and publisher's publication
Depending on the state of the paper, the copyrights holder can be different and the paper file has different license.
Then even with the publisher's version not in CC-BY, my understanding is that the publisher can allow a deposit by the authors on a preprint server, but the copyrights and license normally remain (not further sharable outside the archive server). This is possibly the case for this paper. Or more likely the authors simply put the final paper on arXiv not following the publisher license, because nobody care (not sure about the moderation for arXiv, but on HAL for instance, the deposit would be removed).
On HAL, all preprints can be deposited (EU legal framework), as well as all postprints not under embargo, and the publisher's version cannot be deposited except permission by the publisher or sharable license selected by the publisher or paid by the authors (gold open access).
From the publisher site for this conference proceedings:
Copyright to individual papers as well as the proceedings as a whole is fully owned by the Association for the Advancement of Artificial Intelligence. Permission is required for republication. Please consult the AAAI copyright form for details.
Then even with the publisher's version not in CC-BY, my understanding is that the publisher can allow a deposit by the authors on a preprint server, but the copyrights and license normally remain (not further sharable outside the archive server). This is possibly the case for this paper. Or more likely the authors simply put the final paper on arXiv not following the publisher license, because nobody care (not sure about the moderation for arXiv, but on HAL for instance, the deposit would be removed).
OK, this makes sense and I'm glad I understood correctly 😸 It seems not safe to use this paper as training data 😭
Here another document with segmentation issues: 3834-Article Text-6892-1-10-20190701.pdf
In particular, the first footnote at the end of the abstract, is lost. The segmentation model classify it as
<header>
:Then is tagged as
<copyright>
:Now the interesting thing is that on arXIv (https://arxiv.org/abs/1901.07031) is stated as CC-BY, however both on the paper and on the landing page (https://ojs.aaai.org/index.php/AAAI/article/view/3834) it's stating "copyright ..blablabla". So I assume we cannot use it safely for training data.