kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 443 forks source link

Training Issues with ArXiv papers #777

Open m485liuw opened 3 years ago

m485liuw commented 3 years ago

Hi, we just finished a small scale of data annotation for header model with ArXiv papers (30 pdfs which was annotated wrong by header model) and retrained the header model with the original data plus newly annotated ones. Unfortunately, we found the accuracy stays the same for our generated test set (1000 ArXiv papers). In fact, we found the accuracy barely increases (~88% for abstract, 95% for header) after 300 training data points. We are wondering is it even possible to increase the accuracy by training with more data? If so, any suggestions on designing the dataset (i.e. any rule of selecting?)

Also, we found most of the errors of header model came from: 1. foot/header appears in abstract 2. title not identified 3. key words/ccs in abstract 4. upper model(segmentation model) wrong. After our annotation on papers with the first 3 issues and retraining, some of the papers with these issues got corrected whereas some new pdfs got annotated wrong (originally correct) with these issues. Do you know how to explain this scenario? Shouldn't these issues be supposed to be fixed after retraining (at least, no new errors of this type come out)?

Plus, as mentioned, some errors come from segmentation model. Do you think it is helpful to retrain the segmentation model?

Looking forward to your reply and any plan on improving the header model.

kermitt2 commented 3 years ago

Hello @m485liuw !

Thanks a lot for the issue and the detailed analysis of errors.

I think it's a bit difficult to conclude anything with 30 examples added to the existing 592, but it's possible that with the existing set of features (which is limited due to the small size of training data) and for a given relatively homogeneous collection like arXiv we don't improve beyond 300 examples. When evaluating against PubMed Central, scores were still improving as I was reaching the 600 examples, so I was not facing this issue yet.

Your approach is the one I was following: adding examples if they were wrong with the current model and this is how I got the best learning curve so far.

One important aspect I think is that the current training data is "small size" but "high quality. Every labels were checked several times and we pay a lot of attention to have very consistent labelling (in term of what exactly is labeled, which words are excluded from the labeled chunks, etc.). Introducing examples with slightly different/inconsistent/incomplete labeling approach can impact a lot the learning (make it less efficient) and introduces errors in prediction as compared to older version. Working with such small training sets makes every small labeling errors extremely impactful. On the other hand, this approach permits to get very good accuracy with limited training data and to add new examples with some effects.

If you'd like to contribute to GROBID with training data and if you're not under time pressure, I would very happy to review the 30 examples and check the overall consistency.

There would be different ways to improve the accuracy:

In general, if the error is due to a segmentation error from the segmentation model, indeed it has to be fixed first by retraining the segmentation model - this would be the way to fix "1. foot/header appears in abstract". Title not identified might come from reading order issue, which is something I am still working on in pdfalto. As it is still not stable, I have excluded examples with strong reading order issues from the training data. "3. key words/ccs in abstract" is a relatively common error I also observed with the current mode, and I was thinking so far that more training data would help this.

m485liuw commented 3 years ago

Thanks a lot for the detailed reply! Is "foot/header appears in abstract" error coming from segmentation model? We thought foot/header in the first page should be included in the header part so segmentation model was correct? Here are some of our annotations. It would be really helpful if you could help check any inconsistency. xml.zip

kermitt2 commented 3 years ago

Is "foot/header appears in abstract" error coming from segmentation model? We thought foot/header in the first page should be included in the header part so segmentation model was correct?

Yes you're correct, sorry my fault ! The header/footer in the first page should be included in the header parts because they usually contains metadata important to extract with the header !

Thanks for the annotations, I try to have a look soon for feedback.

kermitt2 commented 3 years ago

Here is the review. I made corrections in every documents, some of them were having significant problems in the authors/affiliation/email sequences.

xml-reviewed.zip

m485liuw commented 3 years ago

Hi, thanks for your detailed annotation. We are interested in title and abstract only so we only annotated those parts. Do you know does it affect the training if we only annotate parts of them?

kermitt2 commented 3 years ago

My experience is that errors and incomplete annotations affect significantly other annotations. By introducing more labels, we improve globally the accuracy because we enrich the representation of the contexts, make the learning more efficient and decrease sources of ambiguity. For instance in NER, to improve the recognition of dates, we typically annotate other numerical entities (like currencies, quantities, reference markers, ...).

In our case, for instance by labeling explicitly keywords and meeting places/venues, we normally improve the accuracy to identify title and abstract for the same amount of training data.

kermitt2 commented 3 years ago

For info I've added the 12 corrected header xml in the training data, together with 12 bioRxiv headers and got improvement in header f-score results for both PMC and bioRxiv evaluation sets.

In your initial set of 12 headers for instance, I corrected 3 titles (xml/2104.06550v1.training.header.tei.xml, 2104.06800v1.training.header.tei.xml and xml/2104.10542v1.training.header.tei.xml). It would mean that your additional training data have a 75% precision title field - considering Grobid annotation guidelines. I think it's not possible to improve the model which already provides above 90% title field accuracy without higher quality annotations than the current performance.

m485liuw commented 3 years ago

Hi Patrice, Did you get improvement for both abstract and title? Btw, is there anywhere I could find how you get your evaluation set?

On Thursday, July 1, 2021, Patrice Lopez @.***> wrote:

For info I've added the 12 corrected header xml in the training data, together with 12 bioRxiv headers and got improvement in header f-score results for both PMC and bioRxiv evaluation sets.

In your initial set of 12 headers for instance, I corrected 3 titles ( xml/2104.06550v1.training.header.tei.xml, 2104.06800v1.training.header. tei.xml and xml/2104.10542v1.training.header.tei.xml). It would mean that your additional training data have a 75% precision title field - considering Grobid annotation guidelines. I think it's not possible to improve the model which already provides above 90% title field accuracy without higher quality annotations than the current performance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kermitt2/grobid/issues/777#issuecomment-872652658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOL3GVYZLBRV6FDIYGOEMHLTVUKXBANCNFSM47CQ2BFA .

kermitt2 commented 3 years ago

Hello !

Btw, is there anywhere I could find how you get your evaluation set?

About the evaluation sets, it's described at https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/ PMC and bioRxiv samples are holdout sets (out of the different training data) and stable over time. It's end-to-end evaluation, so starting from PDF and comparing final TEI XML results with the JATS files.

PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv set has been created by Daniel Ecer.

Did you get improvement for both abstract and title?

Yes, but not a lot for PMC

Last week-end I made an additional "training data" effort (a few dozen examples in segmentation and header, originally failing), and results are again better. So we should be able to continue improving the current header model (independently from improving its design/implementation), just by augmenting the training data like this.

m485liuw commented 3 years ago

Hi Patrice, I saw you mentioned in end to end evaluation that you used pub2tei to generate ground truth. Do you also use that method in training for finding what ground annotate wrong?

On Monday, July 5, 2021, Patrice Lopez @.***> wrote:

Hello !

Btw, is there anywhere I could find how you get your evaluation set?

About the evaluation sets, it's described at https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/ PMC and bioRxiv samples are holdout sets (out of the different training data) and stable over time. It's end-to-end evaluation, so starting from PDF and comparing final TEI XML results with the JATS files.

PMC sample set if from Alexandru Constantin Ph.D, work (PDFX) and bioRxiv set has been created by Daniel Ecer.

Did you get improvement for both abstract and title?

Yes, but not a lot for PMC

-

PMC set, before: https://github.com/kermitt2/ grobid/blob/master/grobid-trainer/doc/PMC_sample_1943. results.grobid-0.7.0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED- BidLSTM-CRF-FEATURES-CITATION-09.06.2021 https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-CITATION-09.06.2021 after: https://github.com/kermitt2/grobid/blob/master/grobid- trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT- Glutton-WAPITI-29.06.2021 https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.7.0-SNAPSHOT-Glutton-WAPITI-29.06.2021

bioRxiv set, before: https://github.com/kermitt2/ grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000. results.grobid-0.6.2-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED- SciBERT-01.11.2020 https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.results.grobid-0.6.2-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-SciBERT-01.11.2020 after: https://github.com/kermitt2/grobid/blob/master/grobid- trainer/doc/bioRxiv_test2000.results.grobid-0.7-0-SNAPSHOT- Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-HEADER CITATIONS-29.06.2021 https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/bioRxiv_test_2000.results.grobid-0.7-0-SNAPSHOT-Glutton-DeLFT-WAPITI-MIXED-BidLSTM-CRF-FEATURES-HEADER_CITATIONS-29.06.2021

Last week-end I made an additional "training data" effort (a few dozen examples in segmentation and header, originally failing), and results are again better. So we should be able to continue improving the current header model (independently from improving its design/implementation), just by augmenting the training data like this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kermitt2/grobid/issues/777#issuecomment-873996183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOL3GV6W7J5LUYOS7KOV5MLTWGBO7ANCNFSM47CQ2BFA .

kermitt2 commented 3 years ago

Hi !

I think it's a good idea. But I am not even at this point... I simply downloaded some random PMC, bioRxiv and arXiv PDF files, process them and selected a few dozens with empty title, empty authors and/or empty abstract (which is a good sign that something very bad happened :).

I added 65 header files and 62 segmentation files in the last week effort, so increase of 10% of the training data. For headers, for PMC sample set it lead to around +1.0 F-score in average for header and for bioRxiv to around +4 (there were around 30 files from bioRxiv, because it was not really represented in the training data so far).

Except for reference parsing, I didn't get this kind of improvement after working on deep learning models during 3 years :D

m485liuw commented 3 years ago

Thanks! Have you uploaded the new models to github?

On Wed, Jul 7, 2021 at 3:41 AM Patrice Lopez @.***> wrote:

Hi !

I think it's a good idea. But I am not even at this point... I simply downloaded some random PMC, bioRxiv and arXiv PDF files, process them and selected a few dozens with empty title, empty authors and/or empty abstract (which is a good sign that something very bad happened :).

I added 65 header files and 62 segmentation files in the last week effort, so increase of 10% of the training data. For headers, for PMC sample set it lead to around +1.0 F-score in average for header and for bioRxiv to around +4 (there were around 30 files from bioRxiv, because it was not really represented in the training data so far).

Except for reference parsing, I didn't get this kind of improvement after working on deep learning models during 3 years :D

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kermitt2/grobid/issues/777#issuecomment-875497186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOL3GVZB6TA23RCEKKKE5MDTWQVNVANCNFSM47CQ2BFA .

kermitt2 commented 3 years ago

yes