Open AngledLuffa opened 1 year ago
Similar:
Querétaro B-Location
, I-Location
in I-Location
Mexico I-Location
paraguay_mercopress_7.txt.tsv
Wagga B-Location
Wagga I-Location
, O
in O
southern O
New B-Location
South I-Location
Wales I-Location
Birmingham, Alabama, in the United States - where to draw the line, or is it one entity?
Pusad B-Location
, O
in O
the O
Yavatmal B-Location
district O
of O
Maharashtra B-Location
where "in" really becomes a problem is when it merges multiple tags:
University B-Organization
of I-Organization
the I-Organization
Witwatersrand I-Organization B-Location
in O I-Location
Johannesburg O I-Location
, O I-Location
South O I-Location
Africa O I-Location
or
Dublin I-Organization B-Location
in O I-Location
Ireland O I-Location
University B-Organization
of I-Organization
Wroclaw I-Organization B-Location
in O
Poland B-Location
Thanks for these. I think we need to make a consistent labeling job here. Maybe we can say that if it's just a comma separating them, it should be one entity and if there are any other words then it shouldn't? What are your thoughts
I think as long as we're consistent, we're fine, but the example where two labels overlap because the in
makes for a larger LOC
is rather problematic
https://github.com/stanfordnlp/en-worldwide-newswire/issues/7#issuecomment-1352565899
Correct. I think the move here is that I'll edit all occurrences that I can find to have entities connected by commas to be one entity span, and then we can have entities that are separated by anything else (e.g. in
) to be separate spans.
Phrases like this: one entity or two?
Conll has "Old Trafford in Manchester" as two, but our standard would normally have "Jakarta, Indonesia" as one