joaoftrodrigues / atpdb-cleanup-and-conversion-to-relational

0 stars 0 forks source link

Validate Born Countries #18

Open joaoftrodrigues opened 1 year ago

joaoftrodrigues commented 1 year ago

Description

Countries are treated separately, from Location and Born.

First Cleaning Step

After treating countries on Location, the same queries were used on Countries of Born, hoping they would have similar values. This effort wasn't enough as it was left 563 unique values that were not valid countries.

Imagem WhatsApp 2022-12-13 às 19 52 10

Second Cleaning Step

As some values are city names, it's going to be used the city dataset to match those, substitute the value to respective country and associate old value as the city.

joaoftrodrigues commented 1 year ago

Second Cleaning Step -> City Match

Process

Collection matchesVerification is the one that contains information about the Country field being well spelled or actually being a country. This is perceived by the field CountryMatch which, when there is a correspondence, it contains a value on an array, otherwise is empty.

Match uncorrect values

By taking the distinct values of country, where there's no match, it's obtained the wrong ones. By running follow query, we get those mentioned values, and store it on a new collection, using redundancy on "_id", because, to do the correspodence with cities, it's needed a collection which can't be reached with only 1 field.

db.matchesVerification.aggregate([{$match: {CountryMatch: {$size: 0}}}, {$group: {"_id": "$BornCountry", "UniqueCountry": {$addToSet: "$BornCountry"}}}, {$unwind: "$UniqueCountry"}, {$out: {db: "atp", coll: "countriesMissing"}}])

Now, having the values gathered on a collection, it's possible to do a lookup to match cities.

db.countriesMissing.aggregate([{$lookup: { from: "cities", localField: "UniqueCountry", foreignField: "name", as: "CityMatch" }}, {$out: {db: "atp", coll:"cityMatches"}}])

After match cities with country values, it was found 230 occurrences where the amount of correspondences were one. The ones that got more than one, means they have more correspondences, so it's not certain, the real country value.

image

Isolate values Now that correspondences were made, it's time to isolate them to match in the original database.

db.cityMatches.aggregate([{$match: {CityMatch: {$size:1}}}, {$group:{"_id": "$UniqueCountry", "Country": {$addToSet:{$first:"$CityMatch.country_name"}}}}, {$out: {db: "atp", coll: "cityToCountry"}}])

With that operation, the new collection holds Countries in an array of single value. To simpler use of it, it was associated directly the value , using $first. It wasn't possible to use it on aggregate, otherwise would be simpler.

db.cityToCountry.updateMany({}, [{$set: {Country: {$first:"$Country"}}}])

Check correspondences on all players' born location db.atpplayers.aggregate([{$lookup: {from: "cityToCountry", localField: "BornCountry", foreignField: "_id", as: "CountryMatch"}}, {$out: {db: "atp", coll: "atpplayers"}}])

Associate respectibe country db.atpplayers.updateMany({CountryMatch: {$size: 1}}, [{$set: {BornCountry: {$first: "$CountryMatch.Country"}}}])

Remove auxiliar field db.atpplayers.updateMany({}, {$unset: {CountryMatch: ""}})

joaoftrodrigues commented 1 year ago

Third Step -> State Match

db.countriesMissing.aggregate([{$lookup: { from: "states", localField: "UniqueCountry", foreignField: "name", as: "StateMatch" }}, {$out: {db: "atp", coll:"stateMatches"}}])