I found a missing <s> tag. Our new extractor script should use any of <s>, <p>, <doc> and <file> (and the respective closing tags) as trigger for a sentence boundary.
Glue tags <g/> indicating that there was no space between the neighbouring tokens are not used.
No occurrences of < or > outside tags.
Number of <p> equals number of <s>, i.e. <p> are useless here.
Our extractor treats any occurrence of an opening or closing tag as a sentence boundary, making it as robust to these inconsistencies as possible without using the content itself as boundary indicator.
Issue #4 reports:
I found a missing
<s>
tag. Our new extractor script should use any of<s>
,<p>
,<doc>
and<file>
(and the respective closing tags) as trigger for a sentence boundary.Glue tags
<g/>
indicating that there was no space between the neighbouring tokens are not used.No occurrences of
<
or>
outside tags.Number of
<p>
equals number of<s>
, i.e.<p>
are useless here.Some
</p>
and</s>
are missing.