Open AngledLuffa opened 1 year ago
Are you saying that it should be tokenized as it
, 's
?
yes, those should be it
, 's
and i
, 'm
, etc
lmk if you need or want some assistance scripting changes like that
I think I've got this, thanks! Will let you know if I need help though
How about cases where a noun is followed by 's
? Are these annotated properly?
Example:
The O
opposition O
's O
poor O
election O
results O
Here's what I'm seeing when inspecting some processed data:
national O
truth O
- O
telling O
process O
would O
have O
on O
Australia B-Location
, O
it O
's O
remarkable O
. O
" O
One O
of O
the O
things O
that O
we O
're O
thinking O
about O
I O
'm O
a O
non O
- O
conformist O
politician O
. O
I O
'm O
a O
revolutionary O
, O
' O
' O
Bouteflika B-Person
told O
The B-Organization
Associated I-Organization
Press I-Organization
Can't find the cases you're talking about. Was that perhaps only for the raw annotated data?
the possessive 's
and the contraction 'm
are correct
when i was going through the data myself, i'd occasionally fix them when i came across such errors
cd processed_annotated
grep "^s O$" * | less # that's a tab character between s and O
af_afrol_16.txt.tsv:s O
af_afrol_18.txt.tsv:s O
af_allaf_15.txt.tsv:s O
af_allaf_24.txt.tsv:s O
af_allaf_24.txt.tsv:s O
af_ips_10.txt.tsv:s O
af_ips_10.txt.tsv:s O
af_ips_10.txt.tsv:s O
etc etc
i'm fairly certain most of those can be cleaned up via a script...
just look for s
on a line by itself, especially after a '
or a curly apostrophe, check that the labels are the same, combine the rows
again, i can take that on ... maybe i should just go ahead and do that
If you could, that would be great. If you have time, of course.
I'm about half done with checking incorrect '
, but am uncovering a whole bunch of other random tokenization errors in the process.
’
(the fancy apostrophe) on a line by itself, followed by s
, d
, t
, etc
Ms .
Jr .
' '
and backticks or curly apostrophes
. . .
instead of as a single token
46-41
or other scores / votes
U . S .
and in one file, cuba_diariodecuba_5.txt.tsv
, César
got cut off many times. I suspect there will be other words like that which need to be cleaned up
alright, i have taken on the ’
tokenizations and the '
tokenizations
the others are still TODO
US, titles, and ellipses are now cleared up. Would still like to look for decade+s
did the decades as well
maybe still need to look for ' '
on two separate lines
it's
gets tokenized into three tokens,it
,'
,s
that should be fixed
same with
'm
't
etc