stanfordnlp / en-worldwide-newswire

An English NER dataset built from foreign newswire
8 stars 0 forks source link

AK-47, F-35, how tokenized? #13

Open AngledLuffa opened 3 months ago

AngledLuffa commented 3 months ago

kind of weird seeing it as

F
-
35
s

on four separate lines

SecroLoL commented 3 months ago

I'm thinking either AK-47, F-35 or AK - 47

However, when it gets to plural terms, it is very strange to do something like AK - 47 s compared to AK-47s as one token.

Therefore, I would go with the entire term being tokenized as one. Thoughts?

AngledLuffa commented 3 months ago

fwiw, EWT and LDC do the opposite. i for one am strongly against AK - 47 but unfortunately they don't seem to want to go with a "specific name" exemption for hyphens

https://github.com/UniversalDependencies/UD_English-EWT/issues/204

SecroLoL commented 3 months ago

Based on that conversation, didn't they end up agreeing that for compounds where a hyphen is inherent to the name, e.g. AK-47 or Jae-hoon, it should be kept as one name? That's what I took away from skimming, at least.

I'm not a fan of separating it into AK - 47, so if there's no conclusion from that thread, what do you think about us taking our own lead on this and tokenizing these cases as one token?

AngledLuffa commented 3 months ago

on the one hand, yes, but on the other hand, our tokenizer won't agree with the pieces at runtime. maybe that shouldn't be a consideration

On Thu, Aug 22, 2024, 9:28 AM Alex Shan @.***> wrote:

Based on that conversation, didn't they end up agreeing that for compounds where a hyphen is inherent to the name, e.g. AK-47 or Jae-hoon, it should be kept as one name? That's what I took away from skimming, at least.

I'm not a fan of separating it into AK - 47, so if there's no conclusion from that thread, what do you think about us taking our own lead on this and tokenizing these cases as one token?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/en-worldwide-newswire/issues/13#issuecomment-2305176460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIN6PCOQMY6T6ANDIDZSYGSJAVCNFSM6AAAAABMVHRG5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVGE3TMNBWGA . You are receiving this because you authored the thread.Message ID: @.***>