kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 458 forks source link

Questions about data annotation in GROBID #1067

Open SirLYC opened 11 months ago

SirLYC commented 11 months ago

Hello kermitt2,

I'm currently working on annotating data for the GROBID project and have a few questions regarding the annotation process. I would appreciate it if you could provide some guidance on the following issues:

  1. In the General Principles section of the documentation, it is mentioned that the text flow should not be changed. Does this mean that the order and content of the text flow in the pre-annotated data cannot be altered or removed? Or does it mean that the internal content of each XML text node cannot be modified, but the external order can be freely adjusted? For example, in the pre-trained data of references.referenceSegmenter.tei.xml, due to the PDF text editing order issue, the extracted text flow contains the content after each reference's number first, followed by the number's content, and some main text content mixed in between. In this case, am I allowed to:

    • Remove non-reference content
    • Move the text flow of the reference number to its corresponding reference entry
  2. When I finish modifying segmentation.xml and proceed to modify fulltext.xml, I find that some content inputted into fulltext.xml does not belong to the body, or some body content is recognized as front during the segmentation stage. In this case, should I remove the content that does not belong to the body and add back the missing content in the body? Additionally, I would like to know if I am allowed to adjust the order of text tokens if the extracted body order does not conform to the human reading order (while still ensuring that the text child nodes remain unchanged)?

I look forward to your response, and thank you for your assistance!

Best regards!

kimn1944 commented 11 months ago

I would like to second @SirLYC's point num 2. In my own testing I see that a piece of text marked as \<body> in the segmentation.xml is incorrectly showing up in the header.xml instead of fulltext.xml. I also wonder if we are able to retrain this behavior by moving that element from header.xml to fulltext.xml.

lfoppiano commented 10 months ago

Dear @SirLYC sorry for the late answers.

  1. The text flow should not be modified, which means that you can only add, move or remove the HTML tags that define each entity. If the data is out of order because of the transformation from PDF, it should not corrected otherwise the model will learn a condition that will not likely happen.

  2. In general I prefer to work transversely because certain annotations might be complex and we become more efficient in correcting them by type rather than by document. So I would first correct all segmentation files (ignoring the ones that are already corrected), then when finished, re-train the segmentation model, re-generate the training data, and move to the next model in the cascade: e.g. full text or header.