kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

ptr type="web" note detected #866

Open rodyoukai opened 2 years ago

rodyoukai commented 2 years ago

Hi

I was training citation model and everything is correctly detected except the URL. this is an example of my data training:

`

Azaola, Elena (2009). El comercio con el dolor y la esperanza. La extorsión telefónica en México. URVIO, Revista Latinoamericana de Estudios de Seguridad, (6), 115-122. ISSN: 1390-3691. https://www.redalyc.org/articulo.oa?id=552656559008 Trejo Nieto, Alejandra (2013). Las economías de las zonas metropolitanas de México en los albores del siglo xxi. Estudios Demográficos y Urbanos, 28(3), 545-591. ISSN: 0186-7210. https://www.redalyc.org/articulo.oa?id=31230011001 ` Maybe I do something wrong but I can't detect it
rodyoukai commented 2 years ago

The same happend with ISSN

kermitt2 commented 2 years ago

Hi @rodyoukai !

Thanks for the issue,

I don't see anything wrong in the training examples for URL and ISSN.

<bibl> <author>Azaola, Elena</author> (<date>2009</date>). <title level="a">El comercio con el dolor y la esperanza. La extorsión telefónica en México</title>. <title level="j">URVIO, Revista Latinoamericana de Estudios de Seguridad</title>, <biblScope unit="volume"></biblScope>(<biblScope unit="issue" type="issue">6</biblScope>), <biblScope unit="page">115-122</biblScope>. <idno type="ISSN"> ISSN: 1390-3691</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=552656559008</ptr> </bibl>

<bibl> <author>Trejo Nieto, Alejandra</author> (<date>2013</date>). <title level="a">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>. <title level="j">Estudios Demográficos y Urbanos</title>, <biblScope unit="volume">28</biblScope>(<biblScope unit="issue" type="issue">3</biblScope>), <biblScope unit="page">545-591</biblScope>. <idno type="ISSN"> ISSN: 0186-7210</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=31230011001</ptr> </bibl>

When trying them with the current citation model, I have correct identification for web url and issn:

Azaola, Elena (2009). El comercio con el dolor y la esperanza. La extorsión telefónica en México. URVIO, Revista Latinoamericana de Estudios de Seguridad, (6), 115-122. ISSN: 1390-3691. https://www.redalyc.org/articulo.oa?id=552656559008
<biblStruct >
    <analytic>
        <title level="a" type="main">El comercio con el dolor y la esperanza</title>
        <author>
            <persName>
                <forename type="first">Elena</forename>
                <surname>Azaola</surname>
            </persName>
        </author>
        <ptr target="https://www.redalyc.org/articulo.oa?id=552656559008" />
    </analytic>
    <monogr>
        <title level="j">Revista Latinoamericana de Estudios de Seguridad</title>
        <idno type="ISSN">1390-3691</idno>
        <imprint>
            <biblScope unit="issue">6</biblScope>
            <biblScope unit="page" from="115" to="122" />
            <date type="published" when="2009" />
        </imprint>
    </monogr>
</biblStruct>
Trejo Nieto, Alejandra (2013). Las economías de las zonas metropolitanas de México en los albores del siglo xxi. Estudios Demográficos y Urbanos, 28(3), 545-591. ISSN: 0186-7210. https://www.redalyc.org/articulo.oa?id=31230011001
<biblStruct >
    <analytic>
        <title level="a" type="main">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>
        <author>
            <persName>
                <forename type="first">Trejo</forename>
                <surname>Nieto</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">Alejandra</forename>
            </persName>
        </author>
        <ptr target="https://www.redalyc.org/articulo.oa?id=31230011001" />
    </analytic>
    <monogr>
        <title level="j">Estudios Demográficos y Urbanos</title>
        <idno type="ISSN">0186-7210</idno>
        <imprint>
            <biblScope unit="volume">28</biblScope>
            <biblScope unit="issue">3</biblScope>
            <biblScope unit="page" from="545" to="591" />
            <date type="published" when="2013" />
        </imprint>
    </monogr>
</biblStruct>

Are you sure that there is no XML parsing errors for your training files? Nothing suspicious when training? How many examples are you using when training?

rodyoukai commented 2 years ago

Hi @kermitt2

Thanks for your answer, I do a few test, let me tell you about it:

I use this endpoint api/processCitation and sending it a raw reference string with the parameter application/x-bibtex

I get this:

@article{-1, author = {Azaola, Elena}, title = {El comercio con el dolor y la esperanza. La extorsión telefónica en México}, journal = {URVIO, Revista Latinoamericana de Estudios de Seguridad}, date = {2009}, year = {2009}, pages = {115--122}, number = {6} }

But if I use application/xml I get this:


<biblStruct >
    <analytic>
        <title level="a" type="main">El comercio con el dolor y la esperanza. La extorsión telefónica en México</title>
        <author>
            <persName><forename type="first">Elena</forename><surname>Azaola</surname></persName>
        </author>
        <idno>1390-3691</idno>
        <ptr target="https://www.redalyc.org/articulo.oa?id=552656559008" />
    </analytic>
    <monogr>
        <title level="j">URVIO, Revista Latinoamericana de Estudios de Seguridad</title>
        <imprint>
            <biblScope unit="issue">6</biblScope>
            <biblScope unit="page" from="115" to="122" />
            <date type="published" when="2009">2009</date>
        </imprint>
    </monogr>
</biblStruct>```

As you can see in xml I get more fields, but the problem is the **idno** tag does not have **type="web"** parameter and **ptr** tag does have the paramter **target** instead of **type** and tthe recover url is a value of a parameter instead of a text between a tags.

By the way, my training data does not have error or rare characters...
kermitt2 commented 2 years ago

Hello !

The encoding of the results follows the TEI, so URL are encoded like this by definition:

<ptr target="https://www.redalyc.org/articulo.oa?id=552656559008" /> 

<ptr> has no type, and target URL is defined by the @target attribute. Why do you think it is a problem?

Maybe I can stress that the encoding of the training data is different from the encoding of the final processed result. Grobid parsing results are metadata, so normalized and independent from a particular order/presentation/serialization. It's the format expected by a catalogue for instance.

Training data follow the input (for instance noisy token sequences from a PDF) and thus are not normalized. As they follow exactly the input string, the encoding is "inline", identifying spans to be extracted, so content is never in an attribute (XML attributes must be normalized to avoid XML failures).

To generate pre-annotated training data format, you can use the batch method createTraining, which produces inline annotations on the exact input reference strings.

rodyoukai commented 2 years ago

I understand, the ptr tag now is clear for me, I appreciate the explanation about the diference of input and output data.

But the ISSN parameter in idno tag is not working for me...

<biblStruct >
    <analytic>
        <title level="a" type="main">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>
        <author>
            <persName>
                <forename type="first">Trejo</forename>
                <surname>Nieto</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">Alejandra</forename>
            </persName>
        </author>
        <ptr target="https://www.redalyc.org/articulo.oa?id=31230011001" />
    </analytic>
    <monogr>
        <title level="j">Estudios Demográficos y Urbanos</title>
        <idno type="ISSN">0186-7210</idno>
        <imprint>
            <biblScope unit="volume">28</biblScope>
            <biblScope unit="issue">3</biblScope>
            <biblScope unit="page" from="545" to="591" />
            <date type="published" when="2013" />
        </imprint>
    </monogr>
</biblStruct>

In your example (above) type parameter exists in idno tag...

kermitt2 commented 2 years ago

In your example (above) type parameter exists in idno tag...

What is your input reference?

With ISSN keyword (e.g. ISSN: 1390-3691.) it works normally, but without (1390-3691.), it's just recognized as an identifier. In general the ISSN is presented with the prefix (ISSN: 1390-3691.), all the cases in the current training data are like that I think.

rodyoukai commented 2 years ago

This is my query:

Azaola, Elena (2009). El comercio con el dolor y la esperanza. La extorsión telefónica en México. URVIO, Revista Latinoamericana de Estudios de Seguridad, (6), 115-122. ISSN: 1390-3691. https://www.redalyc.org/articulo.oa?id=552656559008

kermitt2 commented 2 years ago

With this input reference, the type ISSN appears with the current system (https://grobid.science-miner.com). In your training data, did you add systematically the ISSN prefix in the <idno> field, for example:

... <idno type="ISSN">ISSN: 0186-7210</idno> ...

This is what is expected to have the type of the identifier recognized.

rodyoukai commented 2 years ago

yes I do, this is an example:

<bibl>
<author>Vargas Reyes, Bryan, Ariza Santamaría, Rosembert</author> (<date>2020</date>). <title level="a">Liberación de la madre tierra: entre la legitimidad y los usos sociales de la ilegalidad</title>. <title level="j">Revista Estudios Socio-Jurídicos</title>, <biblScope unit="volume">22</biblScope>(<biblScope unit="issue" type="issue">1</biblScope>), <biblScope unit="page">203-232</biblScope>. ISSN: <idno type="issn">0124-0579</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=73362099007</ptr>
</bibl>
kermitt2 commented 2 years ago

In this example the ISSN: is outside the <idno> mark-up?

Should be:

<bibl>
<author>Vargas Reyes, Bryan, Ariza Santamaría, Rosembert</author> (<date>2020</date>). <title level="a">Liberación de la madre tierra: entre la legitimidad y los usos sociales de la ilegalidad</title>. <title level="j">Revista Estudios Socio-Jurídicos</title>, <biblScope unit="volume">22</biblScope>(<biblScope unit="issue" type="issue">1</biblScope>), <biblScope unit="page">203-232</biblScope>. <idno type="issn">ISSN: 0124-0579</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=73362099007</ptr>
</bibl>

https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/#identifiers

rodyoukai commented 2 years ago

I understand, sorry, the english is not my native language and sometimes I have this issues in my comprehension, I will be retrain the model and check, thanks for your time and patience