mediacloud / cliff-annotator

A lightweight server to allow HTTP requests to the Stanford Named Entity Recognized and a heavily modified CLAVIN geoparser.
https://cliff.mediacloud.org
Apache License 2.0
119 stars 35 forks source link

Manually adjust bad places? #12

Closed kanarinka closed 9 years ago

kanarinka commented 10 years ago

"Reddit" for example is matching to Reddit Creek, CA Not sure if this is still the case but all mentions of "Washington" were being located to WA state [EDIT -- I checked and Washington is still being located to WA state]

Should we keep a running list of these for manually extracting later?

kanarinka commented 10 years ago

"Amazon" is getting matched to Benin

kanarinka commented 10 years ago

"Latin America" is getting located to Illinois:

id: 4899401, lon: -87.65339, source: { charIndex: 178, string: "Latin America" }, name: "Latin America Lutheran Church", countryCode: "US", state: "IL", featureCode: "CH", confidence: 1, lat: 41.94142

kanarinka commented 10 years ago

"Africa" matching to "Mahdia, Tunisia" because of data error from geonames.org. I updated the alternative names on geonames.org but we will have to adjust for this manually or download new data.

rahulbot commented 10 years ago

I fixed "Reddit" and "Amazon". What should "Latin America" and "Africa" resolve to? You can blacklist them for now if that is easiest.

kanarinka commented 10 years ago

Latin America is tough but Africa should resolve to the continent. It's fixed in geonames.org now - what's the process for updating the lucene index with new data? Maybe we should do that together and then I'm happy to document it in the README since we'll have to do that periodically

kanarinka commented 10 years ago

Regarding Latin America --

One thing we could consider doing which is more of a feature is building in some custom support for regions, areas and continents. Geonames' gazetteer has the idea of "areas" but these don't correspond exactly to regions and don't seem to resolve properly (for example - in geonames.org, "Latin America" gets located to an "area" that is in Brazil, "Eastern Africa" doesn't have a good match).

In Terra Incognita I've been using regions as defined by the UN - http://unstats.un.org/unsd/methods/m49/m49regin.htm

So if an article is returned by CLIFF that is about Ghana, Terra Incognita associates the region of West Africa and the continent of Africa to that mention. We could consider doing this at the CLIFF level - like returning aboutness for regions and continents as well.

This could go in both directions - so a mention of "Latin America" could be counted (for aboutness purposes) as a mention of all of the countries in Latin America. And also a mention of Brazil would count towards the region of "Latin America and the Caribbean".

Anyways, something to discuss...

rahulbot commented 9 years ago

I'm going to close this bug - too many issues combined here. I fixed many of them already. Please split off individual errors to bugs (like #25 and #12 do).