semio / ddf--gapminder--gapminder_world

0 stars 9 forks source link

tls double datapoints in http://www.emdat.be indicators #13

Open jheeffer opened 8 years ago

jheeffer commented 8 years ago

It seems East Timor has been renamed over the years from East Timor to Timor-Leste in the http://www.emdat.be dataset.

That's why in Gapminder World google spreadsheets it is featured with both those names on separate rows in emdat.be data https://docs.google.com/spreadsheets/u/1/d/1EMSP8rthB6yAxj3GtPAcssfP0HHPfujRS0YDPmD1NRY/pub https://docs.google.com/spreadsheets/u/1/d/1_UEhuCQeH5MySwuOKmawjRNeQkwP2vJx0rZb7Wgq2wE/pub#

more should be added as we find them

If you look closely, you can see that there is no overlap of numbers > 0. Somewhere between 2003 and 2007 the emdat.be must've changed names. Fill out Timor-Leste here and you see that combined numbers are correct: http://www.emdat.be/country_profile/index.html

Both names are translated to tls when turning Gapminder World to DDFcsv. Therefore, there is tls datapoints for both East Timor and Timor-Leste and thus duplicate keys.

How to solve: 1) Update source, combining East Timor and Timor-Leste data to one row 2) Make script smart so it merges the two 3) Keep script dumb and make an exact copy of data as it is: Make sure there are two separate entity ids for East Timor and Timor-Leste. Though this keeps the error in the dataset (not sound).

This has no priority as ddf--cred--em_dat should overwrite this data correctly in SG. This would be purely to make this historic dataset valid and sound.