kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.26k stars 439 forks source link

Raw Output #221

Open dominic-sps opened 6 years ago

dominic-sps commented 6 years ago

I am just processing the below reference using processRawReference option. Greaves M, Lawlor F. Angioedema: manifestations and management. J Am Acad Dermatol. 1991;25(1 Pt 2):155-161;

First I got the below output :

        <title level="j">J Am Acad Dermatol</title>
        <imprint>
            <biblScope unit="volume">25</biblScope>
            <biblScope unit="issue">1</biblScope>
            <biblScope unit="page" from="155" to="161" />
            <date type="published" when="1991" />
        </imprint>

Then I did some sample training and got the output:

        <title level="j">J Am Acad Dermatol</title>
        <imprint>
            <biblScope unit="volume">25</biblScope>
            <biblScope unit="issue">1 2</biblScope>
            <biblScope unit="page" from="155" to="161" />
            <date type="published" when="1991" />
        </imprint>

Refer the issue details. I noted that some kind of post processing is happening and that removes the content. Missing content is the biggest problem that we face currently with Grobid. Is there any way we could get the basic/raw text without any post processing. I am not looking for proper XML output. I am looking for output something like https://github.com/inukshuk/anystyle-parser hash output option.

kermitt2 commented 6 years ago

Hello Dominic,

There are some design choices behind the scene, and I didn't want to do something like parscit or anystyle-parser (the latter being more or less in term of design a re-packaged parscit), which has a bibtex-style representation of bibliographical information. BibTex is "presentation-oriented", and this kind of raw fields cannot be exploited easily with digital library tools, openUrl, etc. I think this cannot be the kind of representation we want for a bibliographic tool.

Contrary to parscit training data format for citations, only the values of the fields are tagged in GROBID, so not the syntactic sugar around (parenthesis, punctuation, etc.). So for instance, there is no post-processing for the field issue, what you get is actually already the "raw" value. There are of course some post-processing for certain fields, but this is limited because only the actual useful values are labelled and the rest is simply ignored and not require post-processing.

So if you have a special use-case where you need some raw fields output, GROBID might simply not be the right tool as it prioritises compatibility with bibliographic services. I think it makes GROBID more powerful and clean and it improves the CRF recognition accuracy. The representation choice in parscit and the CORA corpus was made a long time ago and, I think, are not the best - for instance anystyle-parser or parscit not even recognized issue from volume, because issue was not an independent field in the CORA corpus, and one reason for this is that "raw fields" were labelled, not allowing unlabelled stuff.

Now more concretely to solve your issue about possible missing information, the first thing you can do is to use the parameter consolidate. As the bibliographical data as tagged by GROBID are compatible with bibliographical services, crossref will complete your citation as follow (processed with current GROBID version):

<biblStruct >
    <analytic>
        <title level="a" type="main">Angioedema: Manifestations and management</title>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">M</forename>
                <surname>Greaves</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">F</forename>
                <surname>Lawlor</surname>
            </persName>
        </author>
    </analytic>
    <monogr>
        <title level="j">Journal of the American Academy of Dermatology</title>
        <title level="j" type="abbrev">Journal of the American Academy of Dermatology</title>
        <idno type="ISSN">01909622</idno>
        <imprint>
            <biblScope unit="volume">25</biblScope>
            <biblScope unit="issue">1</biblScope>
            <biblScope unit="page" from="155" to="161" />
            <date type="published" when="1991" />
        </imprint>
    </monogr>
    <idno type="doi">10.1016/0190-9622(91)70183-3</idno>
</biblStruct>

So you get much richer and complete information than what was in the input string, and the issue field is normalised according to CrossRef publisher metadata.

Second, if you really want the raw fields, you can get them with the Java API, in the class BiblioItem (see the fields prefixed by original*), but it won't help you at all, because as mentioned above in (1 Pt 2), parenthesis and Pt are not labelled, thus the raw issue is anyway in this case 1 2 - which is consistent with the design choice. Of course it could be improved by putting Pt as acceptable value for the issue fields in the training data and/or add some post-processing or data model for issue aware of Pt as part, but this is complicated and would be anyway solved more easily and naturally with consolidate or at the stage of solving/matching the bibliographical reference with a bibliographical database (which always add some flexible/partial/fuzzy matching for robustness).

dominic-sps commented 6 years ago

So if you have a special use-case where you need some raw fields output, GROBID might simply not be the right tool as it prioritises compatibility with bibliographic services. I think it makes GROBID more powerful and clean and it improves the CRF recognition accuracy.

I agree, GROBID is a very powerful engine indeed. I am not sure about how others are using GROBID in a live environment. I discussed with many in the publishing industry and missing content is a big issue for everyone. So my use-case is simple, if GROBID engine is not able to identify a particular content, it should just leave it as it is or put it in a XML comment. In my opinion this will make GROBID even more powerful. This is just not specific for any particular section like bibliography or header. I am not a java person but if I get some pointers on how to do this, then I can arrange to give it a try.