safe-refuge / safeway-data

Data mining tools for the Safeway app
4 stars 4 forks source link

Enhance Mapahelp spider #47

Open littlepea opened 2 years ago

littlepea commented 2 years ago

We need to map the new categories:

And some other feedback about Mapahelp issues:

Duplicated Rows: [454, 1406, 1420, 2059] Rows with errors: [46, 76, 161, 247, 251, 464, 756, 891, 913, 921, 951, 1020, 1022, 1024, 1026, 1027, 1411, 1422, 1547, 1599, 1643, 1825, 2034, 2051, 2239, 2318]

moorchegue commented 2 years ago

Could you share the file where these rows are referenced? I think listing have changed, and I'm probably looking at some other records… I can see that there's a lot of non-unique names (e.g. Укриття, Помощь, Housing, and also perhaps some organization locations without unique names, like Evex), but when I compare coordinates, all seem to be unique.

Also looks like new categories don't have any matches here yet.

littlepea commented 2 years ago

You can see the CSV attached here: https://trello.com/c/s3kYq3Ni/47-web-scraping-for-mapahelpme

moorchegue commented 2 years ago

So I went through all of the above lines. The duplicates seem to be either actually different records (sometimes it's a name of an org, or a person), they'd have a different address, location, etc. Or it's an actual duplicate, likely created by the same person either by mistake, or to provide info in multiple languages, or in an attempt to correct a mistake, because the system doesn't allow editing. For these it would be difficult to determine which should take precedence over which in general. Best way to deal with these would be manual curation with decisions made case by case.

Errors seem to mainly fall in the category of address containing only the city name. This is how it was entered by users, perhaps on purpose in order to protect their privacy, or other reasons, or it's a mistake. We could clear out the city name (since we already have it), or just keep it the way it is. Also after certain point (after about 1000) I think line numbers aren't reffered to correctly, because I haven't noticed anything wrong with those records.

Anyways, let me know what you think should be done here.

littlepea commented 2 years ago

I think it's ok, we can handle duplicates in the next step: #45