stanfordnlp / en-worldwide-newswire

An English NER dataset built from foreign newswire
7 stars 0 forks source link

tokenization issues: ' followed by s, m, t, etc #1

Open AngledLuffa opened 1 year ago

AngledLuffa commented 1 year ago

it's gets tokenized into three tokens, it, ', s

that should be fixed

same with 'm 't etc

SecroLoL commented 3 months ago

Are you saying that it should be tokenized as it, 's?

AngledLuffa commented 3 months ago

yes, those should be it, 's and i, 'm, etc

AngledLuffa commented 3 months ago

lmk if you need or want some assistance scripting changes like that

SecroLoL commented 3 months ago

I think I've got this, thanks! Will let you know if I need help though

SecroLoL commented 2 months ago

How about cases where a noun is followed by 's? Are these annotated properly? Example:

The O
opposition  O
's  O
poor    O
election    O
results O
SecroLoL commented 2 months ago

Here's what I'm seeing when inspecting some processed data:

national    O
truth   O
-   O
telling O
process O
would   O
have    O
on  O
Australia   B-Location
,   O
it  O
's  O
remarkable  O
.   O

"   O
One O
of  O
the O
things  O
that    O
we  O
're O
thinking    O
about   O
I   O
'm  O
a   O
non O
-   O
conformist  O
politician  O
.   O
I   O
'm  O
a   O
revolutionary   O
,   O
'   O
'   O
Bouteflika  B-Person
told    O
The B-Organization
Associated  I-Organization
Press   I-Organization

Can't find the cases you're talking about. Was that perhaps only for the raw annotated data?

AngledLuffa commented 2 months ago

the possessive 's and the contraction 'm are correct

when i was going through the data myself, i'd occasionally fix them when i came across such errors

cd processed_annotated
grep "^s  O$" * | less        # that's a tab character between s and O

af_afrol_16.txt.tsv:s   O
af_afrol_18.txt.tsv:s   O
af_allaf_15.txt.tsv:s   O
af_allaf_24.txt.tsv:s   O
af_allaf_24.txt.tsv:s   O
af_ips_10.txt.tsv:s     O
af_ips_10.txt.tsv:s     O
af_ips_10.txt.tsv:s     O
etc etc
AngledLuffa commented 2 months ago

i'm fairly certain most of those can be cleaned up via a script...

just look for s on a line by itself, especially after a ' or a curly apostrophe, check that the labels are the same, combine the rows

again, i can take that on ... maybe i should just go ahead and do that

SecroLoL commented 2 months ago

If you could, that would be great. If you have time, of course.

AngledLuffa commented 2 months ago

I'm about half done with checking incorrect ', but am uncovering a whole bunch of other random tokenization errors in the process.

(the fancy apostrophe) on a line by itself, followed by s, d, t, etc

Ms .

Jr .

' ' and backticks or curly apostrophes

. . . instead of as a single token

46-41 or other scores / votes

U . S .

and in one file, cuba_diariodecuba_5.txt.tsv, César got cut off many times. I suspect there will be other words like that which need to be cleaned up

AngledLuffa commented 2 months ago

alright, i have taken on the tokenizations and the ' tokenizations

the others are still TODO

AngledLuffa commented 2 months ago

US, titles, and ellipses are now cleared up. Would still like to look for decade+s

AngledLuffa commented 2 months ago

did the decades as well

maybe still need to look for ' ' on two separate lines