kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.42k stars 444 forks source link

Training Citation error #301

Open NapsterSL opened 6 years ago

NapsterSL commented 6 years ago

Hi Team,

Subject : Training citation model Error .

Jones, R. L. & Turner, P. (2006) Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach, Physical Education and Sport Pedagogy, 11, 2: 181–202.

Error found

Issue no. 2 - not identified Journal title(Physical Education and Sport Pedagogy) - not identified

Post-training with the following XML in the path (grobid-trainer\resources\dataset\citation\corpus.citation.xml), we have got the issue and Journal title identified successfully.

Jones, R. L. & Turner, P.(2006) Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach, Physical Education and Sport Pedagogy, 11, 2: 181-202.

But Now, If we change Volume or Issue no to varying numbers(say '2' to '3') the issue enitity is not identified.

working : Jones, R. L. & Turner, P. (2006) Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach, Physical Education and Sport Pedagogy, 11, 2: 181–202.

Not working : Jones, R. L. & Turner, P. (2006) Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach, Physical Education and Sport Pedagogy, 11, 3: 181–202.

Just changing the issue number from '2' to '3' doesn't identify the issue entity.

Thanks

kermitt2 commented 6 years ago

Hello @NapsterSL !

If you add as training data exactly this:

Jones, R. L. &amp; Turner, P.(2006) <title level="a">Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach</title>, <title level="j">Physical Education and Sport Pedagogy</title>, 11, 2: 181-202.

then you're telling the CRF that 2 is not an issue, 11 is not a volume, and so on. you indicate that these text chunks are not to be labelled.

If you want to improve the citation model for your case, you could:

Hope this is helping!

NapsterSL commented 6 years ago

Many thanks for the reply.

Sorry we have pasted the training XML in the above query but some tags were not present in the preview. Please note the following scenario.

Actually we have trained with the below sample

<bibl><author>Jones, R. L. &amp; Turner, P</author>.(<date>2006</date>) <title level="a">Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach</title>, <title level="j">Physical Education and Sport Pedagogy</title>, <biblScope type="vol">11</biblScope>, <biblScope type="issue">2</biblScope>: <biblScope >

And the result was correct with the following sample :

Working :+1:
Jones, R. L. & Turner, P. (2006) Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach, Physical Education and Sport Pedagogy, 11, 2: 181–202.

Not working : Jones, R. L. & Turner, P. (2006) Teaching coaches to coach holistically: the case for a Problem-Based Learning (PBL) approach, Physical Education and Sport Pedagogy, 11, 3: 181–202.

Changing the issue number from '2' to '3' is not working. The issue number is only identified when it is '2' or '1' else it fails to identify the issue number.

Please do note that the Journal Title is working as expected. As per suggestion we will try adding the journal title in the gazetteer file under grobid-home/lexicon/journals/journals.txt.

Thanks

kermitt2 commented 6 years ago

I would try to add more examples of this citation pattern (with different issue numbers).

This is apparently an unusual "issue" pattern for GROBID. There are now 6816 annotated examples, so more than one example are necessary for having some generalization.

NapsterSL commented 6 years ago

Thank you so much for the reply. We will try to add some more patterns and check the same.