joaoftrodrigues / atpdb-cleanup-and-conversion-to-relational

0 stars 0 forks source link

Extract Country #6

Closed joaoftrodrigues closed 1 year ago

joaoftrodrigues commented 1 year ago

Description

Must extract country from Location. After that, the extractions made on location, will be adapted to use on Born.

Regular Expressions

So far, the best expression found, was:

db.atpplayers.aggregate([{ $addFields: {Country: { $regexFind: { input: "$Location", regex: "([a-zA-Z ]*,[a-zA-Z ]*,([ a-zA-Z.]*))|([\w ]*,([ a-zA-Z.]*))" }}}}])

Results

City + Country

image

City + State + Country

image

Was thought to have a problem, but it's possible to reach 3rd position on captures array.

Some considerations

Follow rules would be preferable, instead the ones being used:

But the results didn't reach what was intended, even though, by logic it should. That way, we have more matches than the ones needed.

joaoftrodrigues commented 1 year ago

Queries

Match regular expression in location and store it on db

db.atpplayers.aggregate([{ $addFields: {Country: { $regexFind: { input: "$Location", regex: "([ ]*[a-zA-Z ]*,[ ]?[a-zA-Z ]*,[ ]*([ a-zA-Z.]*))|([\w ]*,[ ]*([ a-zA-Z.]*))" }}}}])

Change countries' values to matched words

Slice makes possible to access an element based on its index, but it returns an array, so $first must be used to get only the value. db.atpplayers.updateMany({}, [{$set: {Country: {$first: {$slice: ["$Country.captures",-1,1]}}}}])

joaoftrodrigues commented 1 year ago

Correct Queries

Previous queries were having country fields with null, this is a fix to the problem.

Match regular expression in location and store it on db

db.atpplayers.aggregate([{ $addFields: {Country: { $regexFind: { input: "$Location", regex: "([ ]*[a-zA-Z ]*,[ ]?[a-zA-Z ]*,[ ]*([ a-zA-Z.]*))|([\w ]*,[ ]*([ a-zA-Z.]*))" }}}}, {$out: {db: "atp", coll: "atpplayers"}}])

Remove nulls from array

This must be done, as fields with state will have 2 nulls on different indexes, from fields without state. This way, the index will be the same for both cases.

db.atpplayers.updateMany({}, [{$set: {Country: "$Country.captures"}}])

db.atpplayers.updateMany({}, {$pull: {Country: null}})

Change countries' values to matched words

Slice makes possible to access an element based on its index, but it returns an array, so $first must be used to get only the value. db.atpplayers.updateMany({}, [{$set: {Country: {$first: {$slice: ["$Country",-1,1]}}}}])