pull request: polling stations

theiostream commented 4 years ago

tl;dr: I want to contribute a dataset that uses official datasets (and Google Maps) to geocode ~95% of all polling stations (seções eleitorais) in Brazil. A CSV of the data can be found here: https://drive.google.com/file/d/1Z4HXG3fF-uNJQxCa_jZt4uroAv9uEL0r/view. [Note: as of now this data is CC-BY 4.0 licensed.]

Here's a demo of it in action in Rio (ignore the bolsonaro_p label):

Captura de Tela 2020-06-14 às 05 07 55

I organized this dataset as part of Pindograma, a soon-to-be-launched data journalism website focused on Brazilian politics. The dataset has been ready in its current state since mid-June (I mentioned it to @JoaoCarabetta at the time in a brief WhatsApp exchange), but I kept it private since I planned to release it along with the website. But it's August already; elections are coming soon; and I figured it would be helpful to the world at large to try to contribute this to geobr at this point.

Does this fit into `geobr`?

To start with, it's possible that geobr might simply not want this. Even though it is an "official" dataset, this is far from being something that can be "directly" added into the library; and it might generate complications the project might not be willing to face.

That said, I would like to make this data as accessible as possible to people, so in case you don't want it, I'd appreciate suggestions on how to distribute it (aside from writing some news stories, which we're obviously working on).

Here's how it works:

Methodology

As of now, the code that generated the dataset is not in a very reproducible state – file paths all point to my computer; dependencies need to be installed manually; and the R scripts need to be run in a 32GB RAM machine in order to work. But the code can be found here: https://github.com/pindograma/mapa. (If everything works out, this will all be in a nice Docker container a month from now.)

Anyway, if you run create_datasets.R followed by match_geocoding.R, it'll basically do the following tasks:

Get locations that were already geocoded by the TSE for 2018 and 2020 (tse_lat/tse_lon in the CSV);
Get locations geocoded by the TSE in 2020 and "redistribute" them to seções with the same address and name in previous years (comp_tse_lat and comp_tse_lon in the CSV);
Normalize school names, and merge with the Education Ministry's schools dataset, which contain lat/lon information for ~70% of schools in the country (inep_lat/inep_lon in the CSV);
Also merge with some datasets from state or municipal Education Departments' datasets, which contain similar information (local_lat/local_lon in the CSV);
Merge, both by placename and by address, with the IBGE's CNEFE – both in its 2010 version, and its 2017 version from the Censo Agropecuário. Its lat/lon info is not very good, so I'm limiting myself to geocoding at the Census tract level (pl_*, ad_*, rural_* columns in the CSV);
For the seções where either zero or only one of the above methods succeeded, it'll create a series of files called EXPORT_GOOGLE_ADDR_*.csv. These files, if passed to the geocode.py script, will geocode these addresses with Google Maps. If the Maps API returns a result with either ROOFTOP or RANGE_INTERPOLATED precision, it gets added to google_lat/google_lon in the CSV;
If nothing else worked, we'll do a similar procedure and send the missing addresses to the Google Places API. This API, however, has a horrible accuracy, so we had to manually go through every one of its results and discard the obviously bad matches. The lat/lon pairs that survived this process are placed in places_lat/places_lon in the CSV;
If even that didn't work, we try some approximations:
- We accept GEOMETRIC_CENTER precision on the Google Maps API and put it in google_approx_lon/google_approx_lat;
- If we found a street on CNEFE that goes through multiple Census tracts, we pick a random one and put it in approx_ad_*.
- We also try matching povoado names with the BC250 aglomerados rurais isolados database, and we put the results in ibge_approx_lat/ibge_approx_lon.

Good Stuff

None of these data sources are perfect, but what's nice about this dataset is that you get all of them! So you can pick from what works best in a given region, and you can use these multiple sources to verify the dataset's integrity.

This is also nice because geobr can drop some data sources in case it doesn't want to incorporate them into the project because they're not "official" (say, Google Maps).

Limitations

Merging the datasets as mentioned above necessarily involves some fuzzy matching. So a small number of rows will be wrongly geocoded. The super red spot at Jardim Oceânico in the picture above is likely one of them. (Hopefully, this problem should be minimized by the fact that there are multiple data sources.)

Further, the data still needs to go through some validity checks I haven't had the time to run thoroughly. For example:

Verifying all points are within the cities they should be in;
Checking for inconsistencies for rows that contain multiple data sources.

There's probably more stuff that I haven't even thought about, and I'd appreciate any ideas you might have. Some direct help actually validating the dataset would also be appreciated.

rafapereirabr commented 4 years ago

Hi Daniel. Thank you for this issue and all the detailed information. There is a lot of food for thought here and I will need some time to digest it given all the other urgent deadlines I'm currently facing in other projects. Having said that, I believe one of the best ways to put all these data publicly available in the short term might be via the basesdedados.org website.

rafapereirabr commented 4 years ago

I just came across this today. https://twitter.com/cepesp/status/1298361122102349825?s=20

theiostream commented 4 years ago

I "skimmed" their video talk and I'm still not exactly sure how they get the data for the locais de votação. However, it seems that they only use TSE data, instead of trying to cross it with other databases like I do. The TSE database, however, only has c. 20% of seções eleitorais geocoded.

I might be wrong about this, though.

theiostream commented 4 years ago

An update on this – I had a chat with the Cepesp people involved in this project, and they indeed appear to be using the TSE data as a basis for their map. They "complete" the missing data in state capitals using the Google Maps API. In the state capitals, they also appear to have done a lot of manual validation to ensure that the data is accurate.

That said, we'll try to work together to improve this dataset – especially on the validation side.

rafapereirabr commented 4 years ago

This is great news! Please keep me posted!

theiostream commented 4 years ago

Hey there! I'd just like to share our website since we're up – https://pindograma.com.br/ :).

hsxavier commented 2 years ago

Hey @theiostream , excellent work! I was wondering about two things:

Do you have the addresses associated to the geographical coordinates of the polling stations? The CSV file you provided does not mention them.
Also, would you have the association between seções and polling station addresses for the 2018 elections? I believe the only data I could find reflects the association for the 2020 elections.

Thanks!

ipeaGIT / geobr