sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.01k stars 204 forks source link

clean_country() for countries belonging to UK are not recognized as country #748

Open FabianPalmaPando opened 2 years ago

FabianPalmaPando commented 2 years ago

clean_country() applied to England and Scotland throws NaN. I believe this would happen for all countries belonging to UK. It would be nice if the function recognices both cases: United Kingdom and England (for example) as different countries, depending on the input.

thanks for creating such an amazing library! :)

qidanrui commented 2 years ago

Hi! Thank you for your brilliant advice. You're right that we need to consider details of different counties! Also, if you are interested in, welcome to update what you like into country_data.tsv and open a PR!

moreaupascal56 commented 2 years ago

Hi just started looking at this project! it looks amazing @qidanrui ! :)

Btw this issue is because countries inside UK are not ISO countries (list here (wikipedia), you can see that Ireland is here but not northern one neither England). I saw that some similar issue is on this PHP repo umpirsky/country-list.

maybe an option in clean_country() would be nice ? clean_country( include_non_iso = TRUE OR FALSE default FALSE) in order to include the data from country_data.tsv and from a new file country_non_iso_data.tsv (with list of uk countries and maybe more if there is 🤔 ) as apparently ISO is the norm in all country lists and packages

moreaupascal56 commented 2 years ago

Btw an other issue is that as these are not ISO countries but ISO "principal subdivisions of a country". The ISO codes are connected to the UK ones like GB-ENG for England (https://en.wikipedia.org/wiki/ISO_3166-2:GB) so we don't have proper values for àlpha-2 alpha-3 and numeric columns (regex neither but we can put country name).

I saw that NaN values are no problem in country_data.tsv but I guess the codes are strings with 2 or 3 len max