ipeaGIT / geobr

Easy access to official spatial data sets of Brazil in R and Python
https://ipeagit.github.io/geobr/
799 stars 119 forks source link

pull request: polling stations #184

Open theiostream opened 4 years ago

theiostream commented 4 years ago

tl;dr: I want to contribute a dataset that uses official datasets (and Google Maps) to geocode ~95% of all polling stations (seções eleitorais) in Brazil. A CSV of the data can be found here: https://drive.google.com/file/d/1Z4HXG3fF-uNJQxCa_jZt4uroAv9uEL0r/view. [Note: as of now this data is CC-BY 4.0 licensed.]

Here's a demo of it in action in Rio (ignore the bolsonaro_p label):

Captura de Tela 2020-06-14 às 05 07 55

I organized this dataset as part of Pindograma, a soon-to-be-launched data journalism website focused on Brazilian politics. The dataset has been ready in its current state since mid-June (I mentioned it to @JoaoCarabetta at the time in a brief WhatsApp exchange), but I kept it private since I planned to release it along with the website. But it's August already; elections are coming soon; and I figured it would be helpful to the world at large to try to contribute this to geobr at this point.

Does this fit into geobr?

To start with, it's possible that geobr might simply not want this. Even though it is an "official" dataset, this is far from being something that can be "directly" added into the library; and it might generate complications the project might not be willing to face.

That said, I would like to make this data as accessible as possible to people, so in case you don't want it, I'd appreciate suggestions on how to distribute it (aside from writing some news stories, which we're obviously working on).

Here's how it works:

Methodology

As of now, the code that generated the dataset is not in a very reproducible state – file paths all point to my computer; dependencies need to be installed manually; and the R scripts need to be run in a 32GB RAM machine in order to work. But the code can be found here: https://github.com/pindograma/mapa. (If everything works out, this will all be in a nice Docker container a month from now.)

Anyway, if you run create_datasets.R followed by match_geocoding.R, it'll basically do the following tasks:

Good Stuff

None of these data sources are perfect, but what's nice about this dataset is that you get all of them! So you can pick from what works best in a given region, and you can use these multiple sources to verify the dataset's integrity.

This is also nice because geobr can drop some data sources in case it doesn't want to incorporate them into the project because they're not "official" (say, Google Maps).

Limitations

Merging the datasets as mentioned above necessarily involves some fuzzy matching. So a small number of rows will be wrongly geocoded. The super red spot at Jardim Oceânico in the picture above is likely one of them. (Hopefully, this problem should be minimized by the fact that there are multiple data sources.)

Further, the data still needs to go through some validity checks I haven't had the time to run thoroughly. For example:

There's probably more stuff that I haven't even thought about, and I'd appreciate any ideas you might have. Some direct help actually validating the dataset would also be appreciated.

rafapereirabr commented 4 years ago

Hi Daniel. Thank you for this issue and all the detailed information. There is a lot of food for thought here and I will need some time to digest it given all the other urgent deadlines I'm currently facing in other projects. Having said that, I believe one of the best ways to put all these data publicly available in the short term might be via the basesdedados.org website.

rafapereirabr commented 4 years ago

I just came across this today. https://twitter.com/cepesp/status/1298361122102349825?s=20

theiostream commented 4 years ago

I "skimmed" their video talk and I'm still not exactly sure how they get the data for the locais de votação. However, it seems that they only use TSE data, instead of trying to cross it with other databases like I do. The TSE database, however, only has c. 20% of seções eleitorais geocoded.

I might be wrong about this, though.

theiostream commented 4 years ago

An update on this – I had a chat with the Cepesp people involved in this project, and they indeed appear to be using the TSE data as a basis for their map. They "complete" the missing data in state capitals using the Google Maps API. In the state capitals, they also appear to have done a lot of manual validation to ensure that the data is accurate.

That said, we'll try to work together to improve this dataset – especially on the validation side.

rafapereirabr commented 4 years ago

This is great news! Please keep me posted!

theiostream commented 4 years ago

Hey there! I'd just like to share our website since we're up – https://pindograma.com.br/ :).

hsxavier commented 2 years ago

Hey @theiostream , excellent work! I was wondering about two things:

Thanks!