Closed kanarinka closed 9 years ago
"Amazon" is getting matched to Benin
"Latin America" is getting located to Illinois:
id: 4899401, lon: -87.65339, source: { charIndex: 178, string: "Latin America" }, name: "Latin America Lutheran Church", countryCode: "US", state: "IL", featureCode: "CH", confidence: 1, lat: 41.94142
"Africa" matching to "Mahdia, Tunisia" because of data error from geonames.org. I updated the alternative names on geonames.org but we will have to adjust for this manually or download new data.
I fixed "Reddit" and "Amazon". What should "Latin America" and "Africa" resolve to? You can blacklist them for now if that is easiest.
Latin America is tough but Africa should resolve to the continent. It's fixed in geonames.org now - what's the process for updating the lucene index with new data? Maybe we should do that together and then I'm happy to document it in the README since we'll have to do that periodically
Regarding Latin America --
One thing we could consider doing which is more of a feature is building in some custom support for regions, areas and continents. Geonames' gazetteer has the idea of "areas" but these don't correspond exactly to regions and don't seem to resolve properly (for example - in geonames.org, "Latin America" gets located to an "area" that is in Brazil, "Eastern Africa" doesn't have a good match).
In Terra Incognita I've been using regions as defined by the UN - http://unstats.un.org/unsd/methods/m49/m49regin.htm
So if an article is returned by CLIFF that is about Ghana, Terra Incognita associates the region of West Africa and the continent of Africa to that mention. We could consider doing this at the CLIFF level - like returning aboutness for regions and continents as well.
This could go in both directions - so a mention of "Latin America" could be counted (for aboutness purposes) as a mention of all of the countries in Latin America. And also a mention of Brazil would count towards the region of "Latin America and the Caribbean".
Anyways, something to discuss...
I'm going to close this bug - too many issues combined here. I fixed many of them already. Please split off individual errors to bugs (like #25 and #12 do).
"Reddit" for example is matching to Reddit Creek, CA Not sure if this is still the case but all mentions of "Washington" were being located to WA state [EDIT -- I checked and Washington is still being located to WA state]
Should we keep a running list of these for manually extracting later?