Open AngledLuffa opened 3 months ago
I'm thinking either AK-47
, F-35
or AK
-
47
However, when it gets to plural terms, it is very strange to do something like
AK
-
47
s
compared to AK-47s
as one token.
Therefore, I would go with the entire term being tokenized as one. Thoughts?
fwiw, EWT and LDC do the opposite. i for one am strongly against AK - 47 but unfortunately they don't seem to want to go with a "specific name" exemption for hyphens
https://github.com/UniversalDependencies/UD_English-EWT/issues/204
Based on that conversation, didn't they end up agreeing that for compounds where a hyphen is inherent to the name, e.g. AK-47
or Jae-hoon
, it should be kept as one name? That's what I took away from skimming, at least.
I'm not a fan of separating it into AK
-
47
, so if there's no conclusion from that thread, what do you think about us taking our own lead on this and tokenizing these cases as one token?
on the one hand, yes, but on the other hand, our tokenizer won't agree with the pieces at runtime. maybe that shouldn't be a consideration
On Thu, Aug 22, 2024, 9:28 AM Alex Shan @.***> wrote:
Based on that conversation, didn't they end up agreeing that for compounds where a hyphen is inherent to the name, e.g. AK-47 or Jae-hoon, it should be kept as one name? That's what I took away from skimming, at least.
I'm not a fan of separating it into AK - 47, so if there's no conclusion from that thread, what do you think about us taking our own lead on this and tokenizing these cases as one token?
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/en-worldwide-newswire/issues/13#issuecomment-2305176460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIN6PCOQMY6T6ANDIDZSYGSJAVCNFSM6AAAAABMVHRG5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVGE3TMNBWGA . You are receiving this because you authored the thread.Message ID: @.***>
kind of weird seeing it as
on four separate lines