Retraining: Introduce new element tag for incremental training in GROBID

YangaPri commented 6 years ago

Hi,

I am having question on bibliography reference incremental training in GROBID.

Is it is possible to add extra newly named tags inside the reference beyond the already existing tag names.

Eg.

DOI: <idno type="doi">10.1016/j.artmed.2010.12.004</idno>, PMID: <idno type="pmid">21232927</idno>

Hereby i added 'PMID' number inside reference as like of 'DOI' number.

Let me know whether we can add new tagging of elements in reference on the GROBID training data set?
Also after training those data set it will identify correctly those new elements with my newly introduced tag name?
It will give the tie.xml with the same tag name as i trained the model?

Thank you in advance.

kermitt2 commented 6 years ago

Hello!

Unfortunately no, it's not that automated.

Right now the identifiers are all tagged internally with the same label (idno), and the identifier type in the training data is not yet exploited. However, we tried to add it for the most common identifier (so <idno type="doi">, <idno type=arxiv>, ...) with the idea that we could add in the future a specific classifier for this if needed, or for evaluation.

The actual recognition of the identifier type is made with a regex (I am a bit ashamed because I like to say that I want to build a tool without any single manual rule), because identifiers all follow well documented patterns and it works well.

Right now, only DOI and arXiv are recognized, but I could add PMID - in another project I used these kind of regex to cover the various cases:

pubmedPattern         : new RegExp('http.*\\/\\/.*ncbi\\.nlm\\.nih\\.gov.*\\/pubmed.*(\\/|=)([0-9]{4,12})', 'i'),
regexPMIDPattern      : new RegExp('(PubMed\\s?(ID\\s?:?|:)|PM\\s?ID)[\\s:\\/]?\\s*([0-9]{4,12})', 'gi'),
regexPrefixPMIDPattern: new RegExp('((PubMed\\s?(ID)?:?)|(PM\\s?ID))[\\s:\\/]*$', 'i'),
regexSuffixPMIDPattern: new RegExp('^\\s*[:\\/]?\\s*([0-9]{4,12})', 'i'),

This supposes also to have enough examples with PMID in the training data so that it is recognized as identifier (it might already be the case).

karatekaneen commented 5 years ago

I'm also looking for extracting PMID's. I haven't gotten around to digging in the code yet but if you maybe could give me an idea of where to start looking i'll make a PR if I pull it off. Anyways, thanks for a great tool!

kermitt2 commented 5 years ago

As mention, the actual recognition of the identifier type is made with a regex on the generic extracted publication identifier, see: https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java#L1829

So easy way is get an extracted PMID is to add the above-mentioned PMID regex, although it would certainly generate a lot of false positive without validation.

From this, you can use biblio-glutton to consolidate/correct/validate/complete extracted metadata with the PMID/PMI ID in addition to the DOI.

The alternative is of course to classify/label with machine learning the various type of identifiers, but it would require a lot of labeling effort and not sure it would be more accurate.

de-code commented 4 years ago

I have a question relating to the idno tag. I noticed the generated training data never contains the type attribute but that might explain it. I am not sure whether the prefix (e.g. doi: or PMID:) should be included in the idno element. It seems to be a bit inconsistent. If it is not included and the type is not used, then it might be difficult to identify some of the ids?

kermitt2 commented 4 years ago

There's only one tag for all identifiers, to avoid increasing the number of labels when it's not necessary. The type of the identifier is then recognized by regex as post-processing. So it's good to keep the prefix in the labeled chunk in the training data, but it's not crucial because even without prefix, the identifiers are not really ambiguous.

de-code commented 4 years ago

There's only one tag for all identifiers, to avoid increasing the number of labels when it's not necessary. The type of the identifier is then recognized by regex as post-processing. So it's good to keep the prefix in the labeled chunk in the training data, but it's not crucial because even without prefix, the identifiers are not really ambiguous.

How would you identify PMIDs if the prefix is not include? I guess they are all numeric and currently no longer than 8 digits (but apparently 1 is also a valid PMID). What do you think of allowing the type identifier access to the previous text rather than including it in the idno element? Then the training data would match more the final output and one could change what tokens to include. Although it would also be good if the model was just able to output the type as well.

kermitt2 commented 4 years ago

In GROBID, PMID requires a prefix to be in the labeled field, otherwise the extracted identifier is not typed. In practice, I think there is always a PMID prefix around when a PMID is present in a bibliographical citations because otherwise it would be ambiguous for a human reader, so it's probably not a real problem.

Sometimes prefix are part of the "official" identifier pattern or not (for example for a PMC identifier but the prefix "PMC" is normally part of a PMC ID by definition), so it would be hard to generalize a systematic rule "with or without prefix" even for the final output.

I would say, keeping the prefix in the labeled chunk when it is present (for instance the doi: or https://doi.org/ prefixes are not always present), as it is the case in the current training data, is the simple and safe general guideline for the annotators.

Yes we could use the text around the identifier to try to type it as fallback if looking at what is in the labeled chunk fails, but do we have a lot of incorrectly typed identifiers currently? This extra process might introduce false positive. What kind of data/data source could be used to test and evaluate this kind of approach?

de-code commented 4 years ago

Sometimes prefix are part of the "official" identifier pattern or not (for example for a PMC identifier but the prefix "PMC" is normally part of a PMC ID by definition), so it would be hard to generalize a systematic rule "with or without prefix" even for the final output.

I guess I am thinking that we could train the model to learn what the identifier itself should be. e.g. just the digits for the PubMed ID, but include the PMC for PubMed Central.

I am also looking at it from the training data generation. The target XML, which is my training data source. That will generally just contain the identifier itself, without the prefix. To generate the training data I will need to assume what the prefix might be. I did that and it seems to be working okay but I didn't look at every example.

Yes we could use the text around the identifier to try to type it as fallback if looking at what is in the labeled chunk fails, but do we have a lot of incorrectly typed identifiers currently? This extra process might introduce false positive. What kind of data/data source could be used to test and evaluate this kind of approach?

I am using the bioRxiv data as the source. The evaluation would be end-to-end I guess. Let's say there is still room for improvement. DOI is mostly okay (~0.75) but frequently contains extra text. PMID is a bit worse (~0.65). PMCID quite bad (<0.1). Although PMCID's are often not annotated well. For some reasons some author templates seem to generate the PMC prefix twice (e.g. PMCPMC1234567). But I don't think it's the major contributor. Hence I am re-training the model.

kermitt2 commented 4 years ago

I quickly looked at the bioRxiv data, the test set with 2000 documents:

there are 2193 PMID in the references, mostly without prefix, but not always: 175 cases include the prefix
there are a few cases like this that jump to the eyes, but hard to quantify:

<pub-id pub-id-type="doi">pmid:23840310</pub-id>

807 PMC identifiers, they all include the prefix, but 640 (79%) have the incorrect pattern in the XML:

<pub-id pub-id-type="pmcid">PMCPMC4093851</pub-id>

So this can't be typed by grobid. When they are normal, they seem relatively well recognized by Grobid.

What's a bit crazy is that in the same document we can have both PMCPMC and correct PMC pattern:

Screenshot from 2020-09-08 11-50-17

18,894 with DOI and they all look excluding the DOI prefix

We could add more training data with identifiers in the Grobid citation model (now we have only 32 instances with PMID and 2 instances with PMCID, out of more than 8000 examples).

The problem with DOI recognition is that for the model it's sometime hard to know when to stop the DOI field, because there are often spaces inserted around the . of the DOI string. It's the same for URL, extra text is sometime added at the end of the extracted url because the string itself is "dirty". Not sure how to improve that.

kermitt2 commented 4 years ago

@de-code Coming back to this, I added the DOI, PMID, and PMC ID fields in the bibliographic reference results for the bioRxiv test set in the end-to-end evaluation. I obtained similar scores as you indicate.

= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

doi                  99.46        80.43        70.82        75.32        1381   
pmcid                99.81        33.33        2.61         4.84         115    
pmid                 99.55        41.42        70.71        52.24        198

However looking at the error cases, skipping the PMC ID for the moment (given that 79% has an invalid pattern, they won't be recognize anyway by grobid, and with only 2 examples in the training data for the moment, it's not significant enough):

Most of the DOI errors do not appear to be the extra text, it's due to the fact that Grobid will add DOI when they are indicated only as PDF annotations (url link). However, they are not encoded in the JATS document. For example, Grobid finds 24 DOI in 156182v1.pdf (they are correct), but it's not encoded in the gold xml so it will be counted as error, or in 145334v1.pdf, almost 150 DOI via GROBID, 0 in gold.
For PMID, it seems that some XML files do not encode PMID when present in the text, for example 353227v1.pdf (30 PMID correctly found by Grobid, 0 encoded in the XML), 380675v1.pd (82 PMID by Grobid, 0 in XML). So they are all counted wrong, although they are correct. That would be an issue if this JATS mark-up are used for training a model.

Given the low number of identifiers in general (see the support above), this gives a wrong picture of the precision when bioRxiv is used for eval. The recall is indeed not wonderful but I think it's a matter of training data.

So next step anyway, add more example in the training data for PMID and PMC ID :)

de-code commented 4 years ago

Thank you for getting back on that.

Most of the DOI errors do not appear to be the extra text, it's due to the fact that Grobid will add DOI when they are indicated only as PDF annotations (url link). However, they are not encoded in the JATS document. For example, Grobid finds 24 DOI in 156182v1.pdf (they are correct), but it's not encoded in the gold xml so it will be counted as error, or in 145334v1.pdf, almost 150 DOI via GROBID, 0 in gold.

Yes, I was going to normalise that for the next step. Also for the PMC IDs which should be fairly straight forward to fix. The PMIDs might be a bit more challenging without making too many assumptions.

I am not yet sure how exactly to do that. I could do that as part of the training data generation and evaluation. I guess the right way would be to generate "fixed" bioRxiv XML and share that. It's just a bit more effort.

For PMID, it seems that some XML files do not encode PMID when present in the text, for example 353227v1.pdf (30 PMID correctly found by Grobid, 0 encoded in the XML), 380675v1.pd (82 PMID by Grobid, 0 in XML). So they are all counted wrong, although they are correct. That would be an issue if this JATS mark-up are used for training a model.

Okay, I also found an example in the validation set, 277335v1 where the first PMID was just copied everywhere. That could be slightly more challenging to fix automatically.

Given the low number of identifiers in general (see the support above), this gives a wrong picture of the precision when bioRxiv is used for eval. The recall is indeed not wonderful but I think it's a matter of training data.

So next step anyway, add more example in the training data for PMID and PMC ID :)

If it's any help, I try to dump new generated training data to: https://github.com/elifesciences/sciencebeam-datasets/releases/tag/v0.0.1

So you could grab some of them from there perhaps. But I will add updated examples and it may not annotate all of the fields that I am not evaluating.

kermitt2 commented 4 years ago

@de-code follow-up... thank you for the pre-annotated references! I added 100 examples from the train set dump with DOI and/or PMID and PMC ID. I kept a few cases with the PMCPMC* pattern. After re-training, Results are now:

label                accuracy     precision    recall       f1           support

doi                  99.34        78.03        78.17        78.1         16893  
pmcid                99.94        65.06        58.61        61.67        807    
pmid                 99.85        66.81        67.22        67.02        2093

About the support, I forgot that I used above only 10% to speed-up the eval, here we have the full eval set. I am not sure that these numbers can really go up with the current evaluation data, there are simply too many errors or missing stuff in the current JATS. DOI for instance is not really improved.

About the pre-annotated references, I used the grobid model ones. They speed-up the manual annotations of course, but as they are, they still require a lot of time to be corrected. They were all having at least one error, usually several. The other problem is that the space characters are really messed up, and it takes time to fix that (is it using Grobid training data generation ?).

de-code commented 4 years ago

About the pre-annotated references, I used the grobid model ones. They speed-up the manual annotations of course, but as they are, they still require a lot of time to be corrected. They were all having at least one error, usually several.

What kind of errors? Are they correct in the corresponding JATS XML? I could see whether the auto-annotation can be improved or obvious error cases could be filtered out.

The other problem is that the space characters are really messed up, and it takes time to fix that (.is it using Grobid training data generation ?).

Do you have an example? They should be all generated from GROBID. The auto-annotation shouldn't change any spaces, but it does read and write out the XML with an opportunity for bugs to creep in. I do have the files as they came out of GROBID which I could dig out.

kermitt2 commented 4 years ago

What kind of errors? Are they correct in the corresponding JATS XML? I could see whether the auto-annotation can be improved or obvious error cases could be filtered out.

Maybe I misunderstood something when using the file. I used the file 2020-08-26-biorxiv-10k-references-citation-train-1890-auto-v0.0.9-grobid-corpus-tei.zip and I looked only at the references with DOI or PMID. Most common errors are missing volume, for example:

<bibl><author>Davey Smith G, Hemani G.</author> <title level="a">Mendelian randomization: genetic anchors for causal inference inepidemiological studies</title> .<title level="j">Hum Mol Genet .</title><date>2014</date> ;23:<biblScope unit="page">R89-98</biblScope>.doi :<idno type="DOI">10.1093/hmg/ddu328.</idno></bibl>

missing date:

<bibl><author>NishioM , SugiyamaO , YakamiM , UenoS , KuboT , KurodaT , etal .</author> <title level="a">Computer- aideddiagnosisof lung nodule classificationbetween benign nodule , primary lungcancer , and metastaticlung cancer at different image size usingdeep convolutional neural network with transfer learning</title> . <title level="j">PLoS One</title> . 2018; <biblScope unit="volume">13</biblScope>( <biblScope unit="issue">7</biblScope>):<biblScope unit="page">e0200721</biblScope>.doi:<idno type="DOI">10.1371 /journal .pone.0200721</idno>.PubMedPMID:<idno type="PMID">30052644</idno>; PubMed CentralPMCID :<idno type="PMC">PMCPMC6063408</idno> .</bibl>

<bibl><author>Deng W, Rupon JW, Krivega I, Breda L, Motta I, Jahn KS, et al.</author> <title level="a">Reactivation of developmentally silenced globin genes by forced chromatin looping</title>. <title level="j">Cell</title>. 2014;158: <biblScope unit="page">849-860</biblScope>.doi :<idno type="DOI">10.1016/j.cell.2014.05.050</idno></bibl>

From the point of view of the guidelines, I added the identifier prefix in the labeled part. More a detail, as illustrated by the first example, we always have the final period inside the DOI field.

There are some missing authors for one of the reference "template" (apparently there are just a few templates for all the bioRxiv references? they all use the same submission system?):

<bibl>BRAZ, J., SOLORZANO, C., WANG, X. &amp; BASBAUM, A. I. <date>2014</date>. <title level="a">Transmitting Pain and Itch Messages: A ContemporaryView of the Spinal Cord Circuits that Generate Gate Control</title> .<title level="j">Neuron</title> ,<biblScope unit="volume">82</biblScope> ,<biblScope unit="page">522 -536</biblScope>.doi: <idno type="DOI">S0896-6273(14)00023-3[pii ];10.1016/j.neuron.2014.01.018[doi ]</idno></bibl>

Usually things go bad when a PII is around, as above.

Then there are all the long tail problems, with missing "publication place", missing "consortium" specific annotation, confusion between volume/issue, or as below error for the journal/publisher - when we sum all of them, it's relatively frequent and everything must be carefully reviewed:

<bibl><author>Wong T-T, Zohar Y.</author> <title level="a">Production of reproductively sterile fish by a non-transgenic gene silencingtechnology .Sci Rep</title> .<title level="j">Nature Publishing Group</title> ;2015 ;1 -<biblScope unit="page">6</biblScope>.<idno type="DOI">doi :10.1016/j.ygcen.2014.12.012</idno></bibl>

Looking at a few corresponding JATS files, they look good (as long this particular JATS flavor can be good, and ignoring the encoding of the DOI in the second one):

<ref id="c12"><mixed-citation publication-type="journal"><string-name><surname>Davey Smith</surname> <given-names>G</given-names></string-name>, <string-name><surname>Hemani</surname> <given-names>G.</given-names></string-name> <article-title>Mendelian randomization: genetic anchors for causal inference in epidemiological studies</article-title>. <source>Hum Mol Genet.</source> <year>2014</year>;<volume>23</volume>:<fpage>R89</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1093/hmg/ddu328</pub-id>.</mixed-citation></ref>

<ref id="c3"><mixed-citation publication-type="journal"><string-name><surname>Braz</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Solorzano</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name> &#x0026; <string-name><surname>Basbaum</surname>, <given-names>A. I.</given-names></string-name> <year>2014</year>. <article-title>Transmitting Pain and Itch Messages: A Contemporary View of the Spinal Cord Circuits that Generate Gate Control</article-title>. <source>Neuron</source>, <volume>82</volume>, <fpage>522</fpage>-<lpage>536</lpage>. doi:<pub-id pub-id-type="doi">S0896-6273(14)00023-3 [pii];10.1016/j.neuron.2014.01.018 [doi]</pub-id></mixed-citation></ref>

<ref id="c32"><mixed-citation publication-type="journal"><string-name><surname>Nishio</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sugiyama</surname> <given-names>O</given-names></string-name>, <string-name><surname>Yakami</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ueno</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kubo</surname> <given-names>T</given-names></string-name>, <string-name><surname>Kuroda</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal> <article-title>Computer-aided diagnosis of lung nodule classification between benign nodule, primary lung cancer, and metastatic lung cancer at different image size using deep convolutional neural network with transfer learning</article-title>. <source>PLoS One</source>. <year>2018</year>;<volume>13</volume>(<issue>7</issue>):<fpage>e0200721</fpage>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0200721</pub-id>. PubMed PMID: <pub-id pub-id-type="pmid">30052644</pub-id>; PubMed Central PMCID: <pub-id pub-id-type="pmcid">PMCPMC6063408</pub-id>.</mixed-citation></ref>

Do you have an example? They should be all generated from GROBID. The auto-annotation shouldn't change any spaces, but it does read and write out the XML with an opportunity for bugs to creep in. I do have the files as they came out of GROBID which I could dig out.

All the above examples have issues with space preservation, but it's a bit systematic in all of them anyway. It can be really destructive for abbreviations... so the problem is for sure then related to the very old trainingExtraction for the citation model in Grobid, which would need to be reviewed/rewritten (I didn't touch the training data for the citation model since years, so maybe I just forgot it was so bad, sorry :D ).

In practice, I fixed manually the space characters for all the 100 new examples, quite time consuming of course - the new examples are here-> https://github.com/kermitt2/grobid/blob/master/grobid-trainer/resources/dataset/citation/corpus/bioRxiv-ids.training.references.tei.xml

de-code commented 4 years ago

What kind of errors? Are they correct in the corresponding JATS XML? I could see whether the auto-annotation can be improved or obvious error cases could be filtered out.

Maybe I misunderstood something when using the file. I used the file 2020-08-26-biorxiv-10k-references-citation-train-1890-auto-v0.0.9-grobid-corpus-tei.zip

Actually there is now also 2020-09-07-biorxiv-10k-references-citation-train-1890-auto-v0.0.9.1-idno-prefix-grobid-corpus-tei.zip. The only difference is probably that it is meant to include the prefix for the idno elements. e.g. doi:.

Since it is doing the annotation automatically, I could try and fix potential bugs and run it again (or run it again after a change to GROBID might have changed things like the spacing). This is as long as the JATS XML is good enough or obviously wrong cases could be filtered out from the training data.

and I looked only at the references with DOI or PMID. Most common errors are missing volume, for example:

<bibl><author>Davey Smith G, Hemani G.</author> <title level="a">Mendelian randomization: genetic anchors for causal inference inepidemiological studies</title> .<title level="j">Hum Mol Genet .</title><date>2014</date> ;23:<biblScope unit="page">R89-98</biblScope>.doi :<idno type="DOI">10.1093/hmg/ddu328.</idno></bibl>

Just for my own reference, found this in 298687v1 with the following JATS XML:

<ref id="c12"><label>12.</label><mixed-citation publication-type="journal"><string-name><surname>Davey Smith</surname> <given-names>G</given-names></string-name>, <string-name><surname>Hemani</surname> <given-names>G.</given-names></string-name> <article-title>Mendelian randomization: genetic anchors for causal inference in epidemiological studies</article-title>. <source>Hum Mol Genet.</source> <year>2014</year>;<volume>23</volume>:<fpage>R89</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1093/hmg/ddu328</pub-id>.</mixed-citation></ref>

The JATS XML seem have the volume annotated. I will need to check why it hasn't been applied.

EDIT: This is because it wasn't using ; as a separator and for very short sequences the alignment requires token level alignment (for longer sequences it's character level).

missing date:


<bibl><author>NishioM , SugiyamaO , YakamiM , UenoS , KuboT , KurodaT , etal .</author> <title level="a">Computer- aideddiagnosisof lung nodule classificationbetween benign nodule , primary lungcancer , and metastaticlung cancer at different image size usingdeep convolutional neural network with transfer learning</title> . <title level="j">PLoS One</title> . 2018; <biblScope unit="volume">13</biblScope>( <biblScope unit="issue">7</biblScope>):<biblScope unit="page">e0200721</biblScope>.doi:<idno type="DOI">10.1371 /journal .pone.0200721</idno>.PubMedPMID:<idno type="PMID">30052644</idno>; PubMed CentralPMCID :<idno type="PMC">PMCPMC6063408</idno> .</bibl>

Found that in 448159v1 with the following JATS XML:

<ref id="c32"><label>32.</label><mixed-citation publication-type="journal"><string-name><surname>Nishio</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sugiyama</surname> <given-names>O</given-names></string-name>, <string-name><surname>Yakami</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ueno</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kubo</surname> <given-names>T</given-names></string-name>, <string-name><surname>Kuroda</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal> <article-title>Computer-aided diagnosis of lung nodule classification between benign nodule, primary lung cancer, and metastatic lung cancer at different image size using deep convolutional neural network with transfer learning</article-title>. <source>PLoS One</source>. <year>2018</year>;<volume>13</volume>(<issue>7</issue>):<fpage>e0200721</fpage>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0200721</pub-id>. PubMed PMID: <pub-id pub-id-type="pmid">30052644</pub-id>; PubMed Central PMCID: <pub-id pub-id-type="pmcid">PMCPMC6063408</pub-id>.</mixed-citation></ref>

<bibl><author>Deng W, Rupon JW, Krivega I, Breda L, Motta I, Jahn KS, et al.</author> <title level="a">Reactivation of developmentally silenced globin genes by forced chromatin looping</title>. <title level="j">Cell</title>. 2014;158: <biblScope unit="page">849-860</biblScope>.doi :<idno type="DOI">10.1016/j.cell.2014.05.050</idno></bibl>

Found that in 372664v1 with the following JATS XML:

<ref id="c39"><label>39.</label><mixed-citation publication-type="journal"><string-name><surname>Deng</surname> <given-names>W</given-names></string-name>, <string-name><surname>Rupon</surname> <given-names>JW</given-names></string-name>, <string-name><surname>Krivega</surname> <given-names>I</given-names></string-name>, <string-name><surname>Breda</surname> <given-names>L</given-names></string-name>, <string-name><surname>Motta</surname> <given-names>I</given-names></string-name>, <string-name><surname>Jahn</surname> <given-names>KS</given-names></string-name>, <etal>et al.</etal> <article-title>Reactivation of developmentally silenced globin genes by forced chromatin looping</article-title>. <source>Cell</source>. <year>2014</year>;<volume>158</volume>: <fpage>849</fpage>&#x2013;<lpage>860</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cell.2014.05.050</pub-id></mixed-citation></ref>

From the point of view of the guidelines, I added the identifier prefix in the labeled part.

Okay, yes, I had generated new data with the prefix.

More a detail, as illustrated by the first example, we always have the final period inside the DOI field.

It seems another "bug" in the auto-annotation.

There are some missing authors for one of the reference "template" (apparently there are just a few templates for all the bioRxiv references? they all use the same submission system?):

<bibl>BRAZ, J., SOLORZANO, C., WANG, X. &amp; BASBAUM, A. I. <date>2014</date>. <title level="a">Transmitting Pain and Itch Messages: A ContemporaryView of the Spinal Cord Circuits that Generate Gate Control</title> .<title level="j">Neuron</title> ,<biblScope unit="volume">82</biblScope> ,<biblScope unit="page">522 -536</biblScope>.doi: <idno type="DOI">S0896-6273(14)00023-3[pii ];10.1016/j.neuron.2014.01.018[doi ]</idno></bibl>

Found that in 344945v1 with the following JATS XML:

<ref id="c3"><mixed-citation publication-type="journal"><string-name><surname>Braz</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Solorzano</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name> &#x0026; <string-name><surname>Basbaum</surname>, <given-names>A. I.</given-names></string-name> <year>2014</year>. <article-title>Transmitting Pain and Itch Messages: A Contemporary View of the Spinal Cord Circuits that Generate Gate Control</article-title>. <source>Neuron</source>, <volume>82</volume>, <fpage>522</fpage>-<lpage>536</lpage>. doi:<pub-id pub-id-type="doi">S0896-6273(14)00023-3 [pii];10.1016/j.neuron.2014.01.018 [doi]</pub-id></mixed-citation></ref>

EDIT: this was because it wasn't matching case-insensitive at that level

Usually things go bad when a PII is around, as above.

Then there are all the long tail problems, with missing "publication place", missing "consortium" specific annotation, confusion between volume/issue, or as below error for the journal/publisher - when we sum all of them, it's relatively frequent and everything must be carefully reviewed:
<bibl><author>Wong T-T, Zohar Y.</author> <title level="a">Production of reproductively sterile fish by a non-transgenic gene silencingtechnology .Sci Rep</title> .<title level="j">Nature Publishing Group</title> ;2015 ;1 -<biblScope unit="page">6</biblScope>.<idno type="DOI">doi :10.1016/j.ygcen.2014.12.012</idno></bibl>

Found that in 429449v1 with the following JATS XML:

<ref id="c56"><label>56.</label><mixed-citation publication-type="journal"><string-name><surname>Wong</surname> <given-names>T-T</given-names></string-name>, <string-name><surname>Zohar</surname> <given-names>Y</given-names></string-name>. <article-title>Production of reproductively sterile fish by a non-transgenic gene silencing technology. Sci Rep</article-title>. <source>Nature Publishing Group</source>; <year>2015</year>; <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <pub-id pub-id-type="doi">doi:10.1016/j.ygcen.2014.12.012</pub-id></mixed-citation></ref>

Looking at a few corresponding JATS files, they look good (as long this particular JATS flavor can be good, and ignoring the encoding of the DOI in the second one):

<ref id="c12"><mixed-citation publication-type="journal"><string-name><surname>Davey Smith</surname> <given-names>G</given-names></string-name>, <string-name><surname>Hemani</surname> <given-names>G.</given-names></string-name> <article-title>Mendelian randomization: genetic anchors for causal inference in epidemiological studies</article-title>. <source>Hum Mol Genet.</source> <year>2014</year>;<volume>23</volume>:<fpage>R89</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1093/hmg/ddu328</pub-id>.</mixed-citation></ref>

<ref id="c3"><mixed-citation publication-type="journal"><string-name><surname>Braz</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Solorzano</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name> &#x0026; <string-name><surname>Basbaum</surname>, <given-names>A. I.</given-names></string-name> <year>2014</year>. <article-title>Transmitting Pain and Itch Messages: A Contemporary View of the Spinal Cord Circuits that Generate Gate Control</article-title>. <source>Neuron</source>, <volume>82</volume>, <fpage>522</fpage>-<lpage>536</lpage>. doi:<pub-id pub-id-type="doi">S0896-6273(14)00023-3 [pii];10.1016/j.neuron.2014.01.018 [doi]</pub-id></mixed-citation></ref>

<ref id="c32"><mixed-citation publication-type="journal"><string-name><surname>Nishio</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sugiyama</surname> <given-names>O</given-names></string-name>, <string-name><surname>Yakami</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ueno</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kubo</surname> <given-names>T</given-names></string-name>, <string-name><surname>Kuroda</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal> <article-title>Computer-aided diagnosis of lung nodule classification between benign nodule, primary lung cancer, and metastatic lung cancer at different image size using deep convolutional neural network with transfer learning</article-title>. <source>PLoS One</source>. <year>2018</year>;<volume>13</volume>(<issue>7</issue>):<fpage>e0200721</fpage>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0200721</pub-id>. PubMed PMID: <pub-id pub-id-type="pmid">30052644</pub-id>; PubMed Central PMCID: <pub-id pub-id-type="pmcid">PMCPMC6063408</pub-id>.</mixed-citation></ref>

Do you have an example? They should be all generated from GROBID. The auto-annotation shouldn't change any spaces, but it does read and write out the XML with an opportunity for bugs to creep in. I do have the files as they came out of GROBID which I could dig out.

All the above examples have issues with space preservation, but it's a bit systematic in all of them anyway. It can be really destructive for abbreviations... so the problem is for sure then related to the very old trainingExtraction for the citation model in Grobid, which would need to be reviewed/rewritten (I didn't touch the training data for the citation model since years, so maybe I just forgot it was so bad, sorry :D ).

I thought it's best to keep the text as is. But you seem to suggest that what GROBID will actually pass to the model later on will have less spacing issues?

In any case I will try to fix the issues with the auto-annotation you raised (the ones that can be fixed).

In practice, I fixed manually the space characters for all the 100 new examples, quite time consuming of course - the new examples are here-> https://github.com/kermitt2/grobid/blob/master/grobid-trainer/resources/dataset/citation/corpus/bioRxiv-ids.training.references.tei.xml

That is great, thank you.

de-code commented 4 years ago

Do you have an idea how much effort it might be to fix the spacing issue in the training data generation? And is that something only you are likely able to do?

de-code commented 4 years ago

Regarding the inclusion of the idno "label". As you mentioned the PII is messing things up.

In your XML you have annotated it like this:

doi:<idno type="PII">S0304-3959(13)00330-8</idno>[pii]; <idno type="DOI">10.1016/j.pain.2013.06.022</idno>[doi]

Which obviously makes sense because the PII is not a DOI. But at the same time it means that here you are not including doi:. in the idno field. Would that not be an argument for not including the label in general?

kermitt2 commented 4 years ago

Do you have an idea how much effort it might be to fix the spacing issue in the training data generation? And is that something only you are likely able to do?

It's fixed. Rather quick and dirty but it looks good now (I just copied what was done elsewhere).

I also added the "serials" field (for things like "Lecture Notes in Computer Science"), which was incorrectly merged with journals before (old quick and dirty). The model needs to be updated for that field so I launched a retraining.

But at the same time it means that here you are not including doi:. in the idno field. Would that not be an argument for not including the label in general?

Well it's really an error from the bioRxiv reference formatter, I don't include it because we can't :) In 99.9% of the case we can add the prefix when it is present.

kermitt2 commented 4 years ago

Looking at one of the example following the space preservation fix (and with the added training data):

Before:

<bibl><author>NishioM , SugiyamaO , YakamiM , UenoS , KuboT , KurodaT , etal .</author> <title level="a">Computer- aideddiagnosisof lung nodule classificationbetween benign nodule , primary lungcancer , and metastaticlung cancer at different image size usingdeep convolutional neural network with transfer learning</title> . <title level="j">PLoS One</title> . 2018; <biblScope unit="volume">13</biblScope>( <biblScope unit="issue">7</biblScope>):<biblScope unit="page">e0200721</biblScope>.doi:<idno type="DOI">10.1371 /journal .pone.0200721</idno>.PubMedPMID:<idno type="PMID">30052644</idno>; PubMed CentralPMCID :<idno type="PMC">PMCPMC6063408</idno> .</bibl>

Now:

<bibl><author>Nishio M, Sugiyama O, Yakami M, Ueno S, Kubo T, Kuroda T, et al.</author> <title level="a">Computer-aided diagnosis of lung nodule classification between benign nodule, primary lung cancer, and metastatic lung cancer at different image size using deep convolutional neural network with transfer learning</title>. <title level="j">PLoS One</title>. <date>2018</date>;<biblScope unit="volume">13</biblScope>(<biblScope unit="issue">7</biblScope>):<biblScope unit="page">e0200721</biblScope>. <idno>doi: 10.1371/journal.pone.0200721</idno>. PubMed <idno>PMID: 30052644</idno>; PubMed Central <idno>PMCID: PMCPMC6063408</idno>.</bibl>

de-code commented 4 years ago

Do you have an idea how much effort it might be to fix the spacing issue in the training data generation? And is that something only you are likely able to do?

It's fixed. Rather quick and dirty but it looks good now (I just copied what was done elsewhere).

That is quick. Thank you.

I also added the "serials" field (for things like "Lecture Notes in Computer Science"), which was incorrectly merged with journals before (old quick and dirty). The model needs to be updated for that field so I launched a retraining.

I am assuming the old models would still work but just not output the serials tag?

But at the same time it means that here you are not including doi:. in the idno field. Would that not be an argument for not including the label in general?

Well it's really an error from the bioRxiv reference formatter, I don't include it because we can't :) In 99.9% of the case we can add the prefix when it is present.

I am not sure whether bioRxiv it as fault here. Is that not what the author submitted and whatever template they chose to use? (which they may have adopted from a journal they want to submit later or had submitted to before)

I found similar ones on PubMed but don't have the link at hand. But here is one from a journal: https://link.springer.com/chapter/10.1007%2F978-3-642-27340-7_8 using:

doi: JVI.02530-07 [pii]

The argument for not including the label would be that the annotation then matches the actual identifier. Whereas with the label I understand it is really to help the GROBID extraction itself based on how it works now.

Maybe in that case it should include the label in square brackets, i.e.:

doi:<idno type="PII">S0304-3959(13)00330-8[pii]</idno>; <idno type="DOI">10.1016/j.pain.2013.06.022[doi]</idno>

Otherwise it would be difficult to identify PII since it doesn't seem to follow a journal independent pattern. (Although for us, the pii is rather secondary but might be useful for cases that don't have any other identifier)

kermitt2 commented 4 years ago

I am assuming the old models would still work but just not output the serials tag?

yes they are outputted as "journal" by the old one - but I've just updated the new model.

I am not sure whether bioRxiv it as fault here. Is that not what the author submitted and whatever template they chose to use? (which they may have adopted from a journal they want to submit later or had submitted to before)

yes you're right! it would be interesting to trace back the template/submission software/publisher/journals that is/are driving this problem, but it seems used a lot by the bioRxiv users.

Maybe in that case it should include the label in square brackets, i.e.:

doi:<idno type="PII">S0304-3959(13)00330-8[pii]</idno>; <idno type="DOI">10.1016/j.pain.2013.06.022[doi]</idno>

I though about that when annotating, but I would need to change the regex patterns and the problem is that 10.1016/j.pain.2013.06.022[doi] itself is actually a DOI (this is so relaxed in the last part!) so that would be really ad hoc capture of the identifier type. In that case I considered that the DOI is good enough to be recognized alone, they will be correctly identified. So I would not consider "postfix" to limit the complexity, as it does not appear useful.

it would be difficult to identify PII since it doesn't seem to follow a journal independent pattern

normally PII follows a constrained pattern, see https://en.wikipedia.org/wiki/Publisher_Item_Identifier As defined, they look easy to catch without prefix. However, many [pii] stuff in bioRxiv articles are not PII, for example:

doi:bts635 [pii] 10.1093/bioinformatics/bts635.

doi:bts635 [pii] 10.1093/bioinformatics/bts635.

07-PLBI-RA-0103 [pii] 10.1371/journal.pbio.0050237

None of these are PII.

(but similar crap in PMID and DOI from time to time, maybe there are some special functions to introduce errors in preprint to help to justify the cost of the final published version ;)

kermitt2 commented 4 years ago

(but similar crap in PMID and DOI from time to time, maybe there are some special functions to introduce errors in preprint to help to justify the cost of the final published version ;)

I found similar ones on PubMed but don't have the link at hand. But here is one from a journal: https://link.springer.com/chapter/10.1007%2F978-3-642-27340-7_8 using:

doi: JVI.02530-07 [pii]

omg, so even in published stuff :D

de-code commented 4 years ago

normally PII follows a constrained pattern, see https://en.wikipedia.org/wiki/Publisher_Item_Identifier

I see. I didn't realise it was that. As I am working on automatically cleaning up the XML using some rules. I could remove the pii annotation in that case and only accept those largely matching the expected pattern.

I am also cleaning up the DOI, PMID and PMCID. I am also adding corresponding annotations where they haven't already been annotated and are easily identifiable.

There should then be a revised set of XML files that could then be used for training and evaluation.

(I don't expect it to be perfect but better than what we currently have, which can be further improved)

de-code commented 4 years ago

Hi @kermitt2

I have uploaded a partially fixed XML version of the bioRxiv dataset:

That is purely based on the XML itself. e.g. it validates the identifiers. If a pmcid doesn't have the PMC prefix with some digits then it's not a PMCID and the tag would be removed. If that pub-id element has a matching value within it, then it would use that. If there is no pmcid pub-id then it would try to find one using simple regular expressions. As it prefers the already existing tags where possible. For DOIs it is a bit more complicated. In any case, it is still not perfect but should should allow to give a better picture of the performance.

kermitt2 / grobid

Retraining: Introduce new element tag for incremental training in GROBID #283