Splitting up TEI XML files for TagWorks

howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.

32 stars 50 forks source link

Splitting up TEI XML files for TagWorks #580

Open jameshowison opened 5 years ago

jameshowison commented 5 years ago

We're exploring using TagWorks for scaling up tagging, they have a sensible user interface and handle recruitment via crowdsourcing etc. Their interface allows the user to reveal additional context before and after the sentence (as well as ask about certainty and highlight parts of the sentence, such as software name, version, etc).

I'm looking at the TEI XML output from grobid, very cool stuff, I love the biblio recognition! My thinking is to have sentences from the <body> as codeable units. Any thoughts on how to break up the <body>?

kermitt2 commented 5 years ago

For the machine learning sequence labelling, I am using the vague notion of "paragraph" as input (it's not sentence-based, I got better results extending to a complete paragraph), and more concretely the following TEI sections:

title: <title level="a">
abstract <abstract>
keywords <keywords>
paragraph <p>
item <item> (if any, but normally always under <p> when generated by grobid)
figure/table caption <figDesc>

I also process the content of the annex (if any) which is not under the <body> in TEI, but under <back>

kermitt2 commented 5 years ago

I don't know tagWorks actually, but it looks promising!

I have a small list of such tools, so I share it here for reference:

https://aws.amazon.com/sagemaker/groundtruth/?nc1=h_ls (allows to benefit from Amazon Mechanical Turk to recruit)
https://www.tagtog.net/
https://github.com/varal7/ieturk (simple UI for Amazon Mechanical Turk)