Open jameshowison opened 5 years ago
For the machine learning sequence labelling, I am using the vague notion of "paragraph" as input (it's not sentence-based, I got better results extending to a complete paragraph), and more concretely the following TEI sections:
<title level="a">
<abstract>
<keywords>
<p>
<item>
(if any, but normally always under <p>
when generated by grobid)<figDesc>
I also process the content of the annex (if any) which is not under the <body>
in TEI, but under <back>
I don't know tagWorks actually, but it looks promising!
I have a small list of such tools, so I share it here for reference:
We're exploring using TagWorks for scaling up tagging, they have a sensible user interface and handle recruitment via crowdsourcing etc. Their interface allows the user to reveal additional context before and after the sentence (as well as ask about certainty and highlight parts of the sentence, such as software name, version, etc).
I'm looking at the TEI XML output from grobid, very cool stuff, I love the biblio recognition! My thinking is to have sentences from the
<body>
as codeable units. Any thoughts on how to break up the<body>
?