pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

Lost data from grobid TEI XML references transformation #89

Open Vitaliy-1 opened 7 years ago

Vitaliy-1 commented 7 years ago

typical parsed bibliography list item from Open Typesetting Stack at http://pkp-udev.lib.sfu.ca/ in JATS format (without authors):

<ref id="R24">
<element-citation>
<article-title>
Randomized Controlled Trial of Family Therapy in Advanced Cancer Continued Into Bereavement
</article-title>
<source>Journal of Clinical Oncology</source>
<year>2016-apr</year>
<fpage>1921</fpage>
<lpage>1927</lpage>
</element-citation>
</ref>

And here is this item from grobid transformation only:

<biblStruct coords="7,103.10,142.06,449.34,10.80;7,103.10,155.86,449.44,10.80;7,103.10,167.17,449.72,13.30"  xml:id="b12">
                    <analytic>
                        <title level="a" type="main">Randomized Controlled Trial of Family Therapy in Advanced Cancer Continued Into Bereavement</title>
                    </analytic>
                    <monogr>
                        <title level="j">J Clin Oncol</title>
                        <imprint>
                            <biblScope unit="volume">1</biblScope>
                            <biblScope unit="issue">16</biblScope>
                            <biblScope unit="page" from="34" to="1621" />
                            <date type="published" when="2016" />
                        </imprint>
                    </monogr>
</biblStruct>

As you can see, information about volume and issue is lost in result JATS XML. I suppose grobid module parses this data from the doi or pubmed links, that are putted to all our bibliogrphic citation list items and they are lost on somewhere on stage tei to jats transformation. This is issue is relevant to all articles, that I have already processed with this online service (near 20). Pages, Journal Title and Year info is also different. So maybe references comes from other module. In this case volume and issue can be grabbed from grobit.

Vitaliy-1 commented 7 years ago

Hmm, as I see from grobit TEI to JATS xslt, it is not used for reference rendering at all. Not good, it maybe parses reference better than other modules :)

axfelix commented 7 years ago

That's correct, we don't use Grobid for reference parsing -- either Cermine or meTypeset is used to detect the reference section, which is then sent to CrossRef to match known-good data, and ParsCit is used to parse any references that didn't have a DOI and couldn't be looked up. ParsCit still outperforms all other local reference parsing solutions that we've tried.

Vitaliy-1 commented 7 years ago

Sometimes Cermine do not see reference section right. For example in article that I have tested with Cermine-only first 2 references were lost. They have been parsed as article text. But that`s was not the case with Grobid. I have not done much tests with the last soft, so could not say for sure what is better. Also I am planning to parse all our articles with open typesetting stack and can compare the result reference section with Grobid analog to see the difference. If it will help you in development of course.

Nevertheless, it will be great to add volume and issue tags inside JATS upon transformation, because now it is manual work for us. Think, it is lost on the stage of rendering CrossRef data (this is a case when reference article has doi or pmid).