stanfordnlp / en-worldwide-newswire

An English NER dataset built from foreign newswire
7 stars 0 forks source link

Jakarta in Indonesia #7

Open AngledLuffa opened 1 year ago

AngledLuffa commented 1 year ago

Phrases like this: one entity or two?

Conll has "Old Trafford in Manchester" as two, but our standard would normally have "Jakarta, Indonesia" as one

AngledLuffa commented 1 year ago

Similar:

Querétaro       B-Location
,       I-Location
in      I-Location
Mexico  I-Location

paraguay_mercopress_7.txt.tsv

AngledLuffa commented 1 year ago
Wagga   B-Location
Wagga   I-Location
,       O
in      O
southern        O
New     B-Location
South   I-Location
Wales   I-Location
AngledLuffa commented 1 year ago

Birmingham, Alabama, in the United States - where to draw the line, or is it one entity?

AngledLuffa commented 1 year ago
Pusad   B-Location
,       O
in      O
the     O
Yavatmal        B-Location
district        O
of      O
Maharashtra     B-Location
AngledLuffa commented 1 year ago

where "in" really becomes a problem is when it merges multiple tags:

University      B-Organization
of      I-Organization
the     I-Organization
Witwatersrand   I-Organization  B-Location
in      O       I-Location
Johannesburg    O       I-Location
,       O       I-Location
South   O       I-Location
Africa  O       I-Location

or

Dublin  I-Organization  B-Location
in      O       I-Location
Ireland O       I-Location
University      B-Organization
of      I-Organization
Wroclaw I-Organization  B-Location
in      O
Poland  B-Location
SecroLoL commented 3 months ago

Thanks for these. I think we need to make a consistent labeling job here. Maybe we can say that if it's just a comma separating them, it should be one entity and if there are any other words then it shouldn't? What are your thoughts

AngledLuffa commented 3 months ago

I think as long as we're consistent, we're fine, but the example where two labels overlap because the in makes for a larger LOC is rather problematic

https://github.com/stanfordnlp/en-worldwide-newswire/issues/7#issuecomment-1352565899

SecroLoL commented 3 months ago

Correct. I think the move here is that I'll edit all occurrences that I can find to have entities connected by commas to be one entity span, and then we can have entities that are separated by anything else (e.g. in) to be separate spans.