joaoftrodrigues / atpdb-cleanup-and-conversion-to-relational

0 stars 0 forks source link

Validate Countries #12

Open joaoftrodrigues opened 1 year ago

joaoftrodrigues commented 1 year ago

Description

Check if countries are well written, and correct wrong ones.

Process

  1. Get countries' dataset
  2. Compare original values with database values (needs split of locations)
  3. Assign correct values to a new field 'country' | Check wrong values to correct

Countries from countries' dataset might have to be edited.

Dataset of countries

Dataset provided by teacher: https://datahub.io/core/country-list#data

Useful commands

Imports

Import countries to a collection (option file customizable) mongoimport --db atp --collection countries --drop --file D:\MCD\1\BDDA\TP\data\countries_code.csv --type=csv --headerline

mongoimport --db atp --collection countryCodes3L --drop --file D:\MCD\1\BDDA\TP\data\wikipedia-iso-country-codes.csv --type=csv --headerline

Others for data checking

Merge atp players' info with countries that match *can be explained better: db.atpplayers.aggregate([{ "$lookup": { from: "countries", localField: "Country", foreignField: "Name", as: "CountryMatch" }}, {$out: {db: "atp", coll:"matchesVerification"}}])

Separate unique values: db.matchesVerification.aggregate([ {$match: {CountryMatch: {$size: 0}}}, {$group: {"_id": null, "CountryEmpty": {$addToSet: "$Country"}}}])

joaoftrodrigues commented 1 year ago

Special Case

From #9 , appeared a player with Born field only having city.

flavio_cobollli_no_country

The city is located in Italy, so the country was changed from null to "Italy":

db.atpplayers.updateMany({PlayerName: 'Flavio Cobolli'}, {$set:{BornCountry: "Italy"}})

joaoftrodrigues commented 1 year ago

Changes to countries dataset

Changed:

Added:

joaoftrodrigues commented 1 year ago

Values' change (Multi)

Format db.atpplayers.updateMany({Country: {$regex: "REGEX"}},{$set: {Country: "CORRECT COUNTRY"}})

Template db.atpplayers.updateMany({Country: {$regex: " "}},{$set: {Country: " "}})

Argentina

db.atpplayers.updateMany({Country: {$regex: "Arg"}},{$set: {Country: "Argentina"}})

Values affected

* keeps its value

Arg. and Argent have Argentina has value more close to. Nonetheless, checking "Arg." 's location, it was spotted on Buenos Aires, confirming Argentina.

db.atpplayers.updateMany({Country: "Buenos Aires"}, {$set: {Country: "Argentina"}})

Australia

db.atpplayers.updateMany({Country: {$in: ["Tasmania", "Victoria", "Devonport", "West Perth"]}}, {$set: {Country: "Australia"}})

Values affected

Austria

db.atpplayers.updateMany({Country: {$in: ["Aut.", "Portschach"]}}, {$set: {Country: "Austria"}})

Values affected

Bosnia and Herzegovina

db.atpplayers.updateMany({Country: {$regex: "Bosnia"}},{$set: {Country: "Bosnia and Herzegovina"}})

Values affected

Brazil

db.atpplayers.updateMany({Country: {$in: ["Braz", "Bahia","Brasilia","Brazi","Florianapolis","Guarulhos","Rio de Janeiro"]}}, {$set: {Country: "Brazil"}})

Values affected

Canada

db.atpplayers.updateMany({Country: {$in: ["Alberta", "Canda", "Ontaria", "Calgary"]}}, {$set: {Country: "Canada"}})

Values affected

China

db.atpplayers.updateMany({Country: {$in: ["Nanjing", "Wuhan"]}}, {$set: {Country: "China"}})

Values affected

Colombia

db.atpplayers.updateMany({Country: {$regex: "(Columbia)|(Col$)", $options:"m"}},{$set: {Country: "Colombia"}})

Values affected

db.atpplayers.updateMany({Country: "Villavicencio"}, {$set: {Country: "Colombia"}})

Côte d'Ivoire

db.atpplayers.updateMany({Country: {$in: ["Ivory Coast", "Abidjan"]}}, {$set: {Country: "Côte d'Ivoire"}})

Values affected

Cuba

db.atpplayers.updateMany({Country: {$in:["Ciudad de Habana", "Havana"]}}, {$set: {Country: "Cuba"}})

Values affected

Czech Republic

db.atpplayers.updateMany({Country: {$regex: "Czech Rep"}},{$set: {Country: "Czech Republic"}})

Values affected

* keeps its value

Dominican Republic

db.atpplayers.updateMany({Country:{$in:["D.R.", "Domincan Republic", "Santo Domingo"]}}, {$set: {Country: "Dominican Republic"}})

Values affected

Ecuador

db.atpplayers.updateMany({Country: {$in: ["Guayaquil", "Salinas"]}}, {$set: {Country: "Ecuador"}})

Values affected

Egypt

db.atpplayers.updateMany({Country: {$regex: "(Egpyt)|(Eygpt)"}},{$set: {Country: "Egypt"}})

Values affected

db.atpplayers.updateMany({Country: "Sharm El Sheikh"}, {$set: {Country: "Egypt"}})

El Salvador

db.atpplayers.updateMany({Country: {$in: ["Salvador", "San Salvador"]}}, {$set: {Country: "El Salvador"}})

Values affected

France

db.atpplayers.updateMany({Country: {$in: ["Bourg", "Forbach", "Lesser Antilles", "New Caledoni", "Reunion Island", "Uriage", "Angleur - Liege", "Pau","Toulouse"]}}, {$set: {Country: "France"}})

Values affected

Germany

db.atpplayers.updateMany({Country: {$in:["Wetzlar", "Braunschweig"]}},{$set: {Country: "Germany"}})

Values affected

Guatemala

db.atpplayers.updateMany({Country: {$in:["GUA","Guatemala City"]}}, {$set: {Country: "Guatemala"}})

Values affected

Hungary

db.atpplayers.updateMany({Country: {$in: ["Hu", "Hunary"]}}, {$set: {Country: "Hungary"}})

Values affected

Indonesia

db.atpplayers.updateMany({Country: {$in:["INA","Jakarta", "Surabaya"]}}, {$set: {Country: "Indonesia"}})

Values affected

Israel

db.atpplayers.updateMany({Country: {$in:["Hasharon", "Isra", "Ramat Hasharon"]}}, {$set: {Country: "Israel"}})

Values affected

Italy

db.atpplayers.updateMany({Country: {$in:["Ita", "Sardinia", "Reggio Calabria"]}}, {$set: {Country: "Italy"}})

Values affected

Luxembourg

db.atpplayers.updateMany({Country:{$in: ["Esch", "Mondorf-Les-Bains", "Esch/Alzette"]}},{$set: {Country: "Luxembourg"}})

Values affected

Malaysia

db.atpplayers.updateMany({Country:{$in:["MAS","Kuala Lampur"]}}, {$set: {Country: "Malaysia"}})

Values affected

Mexico

db.atpplayers.updateMany({Country: {$in: ["Meixco","Mexica", "Mexico City", "Toluca"]}}, {$set: {Country: "Mexico"}})

Values affected

Netherlands

Regular expression (close to Netherlands)

db.atpplayers.updateMany({Country: {$regex: "Nethe", $options: "m"}},{$set: {Country: "Netherlands"}})

Values affected

Variances

db.atpplayers.updateMany({Country: {$in: ["Curacao", "Dutch Anti", "Dutch Antil","NED","The Hague", "'s-Hertogenbosch","Elndhoven", "Willemstad"]}},{$set: {Country: "Netherlands"}})

Values affected

New Zealand

db.atpplayers.updateMany({Country: {$in: ["New Zealan", "Wellington"]}}, {$set: {Country: "New Zealand"}})

Values affected

North Macedonia

Although 'Macedonia' is a shorter version, it's used to name a region which is not entirely from North Macedonia country.

db.atpplayers.updateMany({Country: {$regex:"Mace"}}, {$set: {Country: "North Macedonia"}})

Values affected

Philippines

db.atpplayers.updateMany({Country: {$in: ["PHI", "Phillipines", "Manila", "Manilla"]}}, {$set: {Country: "Philippines"}})

Values affected

Portugal

db.atpplayers.updateMany({Country: {$in: ["Estoril", "Faro"]}}, {$set: {Country: "Portugal"}})

Values affected

Romania

db.atpplayers.updateMany({Country: {$in:["Bucharest","Constanta"]}}, {$set: {Country: "Romania"}})

Values affected

Russia

db.atpplayers.updateMany({Country: {$in: ["Korolev", "Moscow"]}}, {$set: {Country: "Russia"}})

Values affected

Serbia

db.atpplayers.updateMany({Country: {$in: ["Belgrade", "SCG", "SErgia "]}}, {$set: {Country: "Serbia"}})

Values affected

Slovakia

db.atpplayers.updateMany({Country: {$regex: "^Slov[akia]+", $options: "m"}},{$set: {Country: "Slovakia"}})

Values affected

* keeps its value

db.atpplayers.updateMany({Country: {$in: ["Bratislava", "Presov"]}}, {$set: {Country: "Slovakia"}})

Slovenia

db.atpplayers.updateMany({Country: {$in: ["Solvenia", "Portoroz"]}}, {$set: {Country: "Slovenia"}})

Values affected

South Africa

db.atpplayers.updateMany({Country: {$regex: "(South$)|South Africa", $options: "m"}},{$set: {Country: "South Africa"}})

Values affected

* This value is justified to be substituted to "South Africa", some comments below.

db.atpplayers.updateMany({Country: {$in: ["Johannesburg", "Nelspruit", "Polokwane", "Pretoria"]}}, {$set: {Country: "South Africa"}})

Values affected

Spain

db.atpplayers.updateMany({Country: {$in:["Bakio","Melilla", "Valldoreix","Barcelona", "Palma de Mallorca"]}}, {$set: {Country: "Spain"}})

Values affected

Switzerland

db.atpplayers.updateMany({Country: {$in:["Switz.", "Neuchatel"]}}, {$set: {Country: "Switzerland"}})

Values affected

Taiwan

Taipei variances

db.atpplayers.updateMany({Country: {$regex: "(Chinese Ta[ipei]*)|(Taipei)"}},{$set: {Country: "Taiwan"}})

Values affected

Others

db.atpplayers.updateMany({Country: {$in:["TPE", "Kaohsiung"]}}, {$set: {Country: "Taiwan"}})

Tunisia

db.atpplayers.updateMany({Country: {$in: ["Monastir", "Tunis"]}}, {$set: {Country: "Tunisia"}})

Values affected

United Kingdom

db.atpplayers.updateMany({Country: {$in: ["Britain","Great Britain", "UK", "Devon", "London"]}}, {$set: {Country: "United Kingdom"}})

Values affected

Uruguay

db.atpplayers.updateMany({Country: {$in: ["Uraguay", "URU", "Urugay", "Montevideo"]}},{$set: {Country: "Uruguay"}})

Values affected

USA (United States of America)

db.atpplayers.updateMany({Country: {$regex: "^U[SA.]+$", $options: "m"}},{$set: {Country: "United States"}})

Values affected

db.atpplayers.updateMany({Country: {$in:["OK","Texas", "United S", "United States of America", "Delray Beach"]}}, {$set: {Country: "United States"}})

Values affected

Uzbekistan

db.atpplayers.updateMany({Country: {$in: ["Uzb.", "Tashkent"]}}, {$set: {Country: "Uzbekistan"}})

Values affected

Venezuela

db.atpplayers.updateMany({Country: {$in:["Venezeuela", "Lara"]}}, {$set: {Country: "Venezuela"}})

Values affected

joaoftrodrigues commented 1 year ago

South Case

It was spotted countries having the value "South" only. After a match of distinct locations, containing it, it only shown locations on Doornfontein, a city from South Africa.

image

This way, the only locations having "South" as country, are "South Africa".

joaoftrodrigues commented 1 year ago

The Nethe Case

Spotted a Country written only "The Nethe", as it could be "The Netherlands". Nonetheless, it was verified, and based on the city, its is, indeed.

image

joaoftrodrigues commented 1 year ago

Values' change of individual cases

This is separated from regular expressions, for size purposes.

db.atpplayers.updateMany({Country: " "}, {$set: {Country: " "}})

Countries

Austria

db.atpplayers.updateMany({Country: "Aut."}, {$set: {Country: "Austria"}})

Belarus

db.atpplayers.updateMany({Country: "Minsk"}, {$set: {Country: "Belarus"}})

Belgium

db.atpplayers.updateMany({Country: "Liege"}, {$set: {Country: "Belgium"}})

Bolivia

db.atpplayers.updateMany({Country: "Santa Cruz de la Sie"}, {$set: {Country: "Bolivia"}})

Botswana

db.atpplayers.updateMany({Country: "Botwana"}, {$set: {Country: "Botswana"}})

Chile

db.atpplayers.updateMany({Country: "Santiago"}, {$set: {Country: "Chile"}})

China

db.atpplayers.updateMany({Country: "Nanjing"}, {$set: {Country: "China"}})

Croatia

db.atpplayers.updateMany({Country: "Crotia"}, {$set: {Country: "Croatia"}})

Denmark

db.atpplayers.updateMany({Country: "DEN."}, {$set: {Country: "Denmark"}})

India

db.atpplayers.updateMany({Country: "Kolkata"}, {$set: {Country: "India"}})

Iran

db.atpplayers.updateMany({Country: "Tehran"}, {$set: {Country: "Iran"}})

Ireland

db.atpplayers.updateMany({Country: "Dublin"}, {$set: {Country: "Ireland"}})

Fiji

db.atpplayers.updateMany({Country: "Fiji Island"}, {$set: {Country: "Fiji"}})

Moldova

db.atpplayers.updateMany({Country: "Moldovia"}, {$set: {Country: "Moldova"}})

Morocco

db.atpplayers.updateMany({Country: "Casablanca"}, {$set: {Country: "Morocco"}})

Pakistan

db.atpplayers.updateMany({Country: "Lahore"}, {$set: {Country: "Pakistan"}})

Poland

db.atpplayers.updateMany({Country: "Wrocklaw"},{$set: {Country: "Poland"}})

Qatar

db.atpplayers.updateMany({Country: "Doha"}, {$set: {Country: "Qatar"}})

Serbia and Montenegro

db.atpplayers.updateMany({Country: "Serbia & Montenegro"}, {$set: {Country: "Serbia and Montenegro"}})

South Korea

db.atpplayers.updateMany({Country: "Korea"}, {$set: {Country: "South Korea"}})

Switzerland

db.atpplayers.updateMany({Country: "Switz."}, {$set: {Country: "Switzerland"}})

Thailand

db.atpplayers.updateMany({Country: "Bangkok"}, {$set: {Country: "Thailand"}})

Trinidad and Tobago

db.atpplayers.updateMany({Country: "TRI"}, {$set: {Country: "Trinidad and Tobago"}})

Turkey

db.atpplayers.updateMany({Country: "Antalya"}, {$set: {Country: "Turkey"}})

United Arab Emirates

db.atpplayers.updateMany({Country: "U.A.E."}, {$set: {Country: "United Arab Emirates"}})

Ukraine

db.atpplayers.updateMany({Country: "Novaya Kakhovka"}, {$set: {Country: "Ukraine"}})

Yugoslavia

db.atpplayers.updateMany({Country: "Yug."},{$set: {Country: "Yugoslavia"}})

Zimbabwe

db.atpplayers.updateMany({Country: "Harare"}, {$set: {Country: "Zimbabwe"}})

joaoftrodrigues commented 1 year ago

Values as 2 letter codes

Some countries are inputted as its 2 letter nomenclature, so it was matched those values with the countries database, to substitute.

Queries

Match code values with countries' database entries db.atpplayers.aggregate([{ "$lookup": { from: "countries", localField: "Country", foreignField: "Code", as: "CountryMatch" }}, {$out: {db: "atp", coll:"atpplayers"}}])

Extract first element of array db.atpplayers.updateMany({CountryMatch: {$size: 1}}, [{$set: {Country: {$first: "$CountryMatch"}}}])

Associate respective country db.atpplayers.updateMany({CountryMatch: {$size: 1}}, [{$set: {Country: "$Country.Name"}}])

Delete variable CountryMatch db.atpplayers.updateMany({},{$unset: {CountryMatch: ""}})

Countries affected

Running a distinct of countries on the match, these were the ones found:

2d_code_matches_results

Values changed

Based on previous image, the follow names were changed:

joaoftrodrigues commented 1 year ago

Values as 3 letter code

Same case as previous, but for 3 letters code.

Queries

Match code values with countries' database entries db.atpplayers.aggregate([{ "$lookup": { from: "countryCodes3L", localField: "Country", foreignField: "Alpha-3", as: "CountryMatch" }}, {$out: {db: "atp", coll:"atpplayers"}}])

Extract first element of array db.atpplayers.updateMany({CountryMatch: {$size: 1}}, [{$set: {Country: {$first: "$CountryMatch"}}}])

Associate respective country db.atpplayers.updateMany({CountryMatch: {$size: 1}}, [{$set: {Country: "$Country.CountryName"}}])

Countries affected

Running a distinct of countries on the match, these were the ones found:

image

Values changed

joaoftrodrigues commented 1 year ago

Laos is not on countries csv

joaoftrodrigues commented 1 year ago

TBA Treatment

Locations with TBA as location, has follow tournaments:

Imagem WhatsApp 2022-12-13 às 16 20 20

So the countries will be extracted from the tournament name.

Queries

db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: " "}}, {$set: {Country: " "}})

Botswana db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "Botswana"}}, {$set: {Country: "Botswana"}})

Egypt db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "Egypt"}}, {$set: {Country: "Egypt"}})

France db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "France"}}, {$set: {Country: "France"}})

India db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "India"}}, {$set: {Country: "India"}})

Italy db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "Italy"}}, {$set: {Country: "Italy"}})

United Kingdom db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "Great Britain"}}, {$set: {Country: "United Kingdom"}})

United States db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "U.S.A."}}, {$set: {Country: "United States"}})

Zimbabwe db.atpplayers.updateMany({Country:"TBA", Tournament: {$regex: "Zimbabwe"}}, {$set: {Country: "Zimbabwe"}})

joaoftrodrigues commented 1 year ago

TBC

With location as "TBC" were found follow tournaments:

image

Countries will be extracted as previous case, in this particular, is just one, so will be a direct attribution.

db.atpplayers.updateMany({Country: "TBC"}, {$set: {Country: "United Kingdom"}})

TBD

Tournaments with TBD location:

image

Using previous methodology: db.atpplayers.updateMany({Country: "TBD"}, {$set: {Country: "United States"}})