Open theiostream opened 4 years ago
Hi Daniel. Thank you for this issue and all the detailed information. There is a lot of food for thought here and I will need some time to digest it given all the other urgent deadlines I'm currently facing in other projects. Having said that, I believe one of the best ways to put all these data publicly available in the short term might be via the basesdedados.org website.
I just came across this today. https://twitter.com/cepesp/status/1298361122102349825?s=20
I "skimmed" their video talk and I'm still not exactly sure how they get the data for the locais de votação. However, it seems that they only use TSE data, instead of trying to cross it with other databases like I do. The TSE database, however, only has c. 20% of seções eleitorais geocoded.
I might be wrong about this, though.
An update on this – I had a chat with the Cepesp people involved in this project, and they indeed appear to be using the TSE data as a basis for their map. They "complete" the missing data in state capitals using the Google Maps API. In the state capitals, they also appear to have done a lot of manual validation to ensure that the data is accurate.
That said, we'll try to work together to improve this dataset – especially on the validation side.
This is great news! Please keep me posted!
Hey there! I'd just like to share our website since we're up – https://pindograma.com.br/ :).
Hey @theiostream , excellent work! I was wondering about two things:
Thanks!
tl;dr: I want to contribute a dataset that uses official datasets (and Google Maps) to geocode ~95% of all polling stations (seções eleitorais) in Brazil. A CSV of the data can be found here: https://drive.google.com/file/d/1Z4HXG3fF-uNJQxCa_jZt4uroAv9uEL0r/view. [Note: as of now this data is CC-BY 4.0 licensed.]
Here's a demo of it in action in Rio (ignore the
bolsonaro_p
label):I organized this dataset as part of Pindograma, a soon-to-be-launched data journalism website focused on Brazilian politics. The dataset has been ready in its current state since mid-June (I mentioned it to @JoaoCarabetta at the time in a brief WhatsApp exchange), but I kept it private since I planned to release it along with the website. But it's August already; elections are coming soon; and I figured it would be helpful to the world at large to try to contribute this to
geobr
at this point.Does this fit into
geobr
?To start with, it's possible that
geobr
might simply not want this. Even though it is an "official" dataset, this is far from being something that can be "directly" added into the library; and it might generate complications the project might not be willing to face.That said, I would like to make this data as accessible as possible to people, so in case you don't want it, I'd appreciate suggestions on how to distribute it (aside from writing some news stories, which we're obviously working on).
Here's how it works:
Methodology
As of now, the code that generated the dataset is not in a very reproducible state – file paths all point to my computer; dependencies need to be installed manually; and the R scripts need to be run in a 32GB RAM machine in order to work. But the code can be found here: https://github.com/pindograma/mapa. (If everything works out, this will all be in a nice Docker container a month from now.)
Anyway, if you run
create_datasets.R
followed bymatch_geocoding.R
, it'll basically do the following tasks:Get locations that were already geocoded by the TSE for 2018 and 2020 (
tse_lat
/tse_lon
in the CSV);Get locations geocoded by the TSE in 2020 and "redistribute" them to seções with the same address and name in previous years (
comp_tse_lat
andcomp_tse_lon
in the CSV);Normalize school names, and merge with the Education Ministry's schools dataset, which contain lat/lon information for ~70% of schools in the country (
inep_lat
/inep_lon
in the CSV);Also merge with some datasets from state or municipal Education Departments' datasets, which contain similar information (
local_lat
/local_lon
in the CSV);Merge, both by placename and by address, with the IBGE's CNEFE – both in its 2010 version, and its 2017 version from the Censo Agropecuário. Its lat/lon info is not very good, so I'm limiting myself to geocoding at the Census tract level (
pl_*
,ad_*
,rural_*
columns in the CSV);For the seções where either zero or only one of the above methods succeeded, it'll create a series of files called
EXPORT_GOOGLE_ADDR_*.csv
. These files, if passed to thegeocode.py
script, will geocode these addresses with Google Maps. If the Maps API returns a result with eitherROOFTOP
orRANGE_INTERPOLATED
precision, it gets added togoogle_lat
/google_lon
in the CSV;If nothing else worked, we'll do a similar procedure and send the missing addresses to the Google Places API. This API, however, has a horrible accuracy, so we had to manually go through every one of its results and discard the obviously bad matches. The lat/lon pairs that survived this process are placed in
places_lat
/places_lon
in the CSV;If even that didn't work, we try some approximations:
GEOMETRIC_CENTER
precision on the Google Maps API and put it ingoogle_approx_lon
/google_approx_lat
;approx_ad_*
.ibge_approx_lat
/ibge_approx_lon
.Good Stuff
None of these data sources are perfect, but what's nice about this dataset is that you get all of them! So you can pick from what works best in a given region, and you can use these multiple sources to verify the dataset's integrity.
This is also nice because
geobr
can drop some data sources in case it doesn't want to incorporate them into the project because they're not "official" (say, Google Maps).Limitations
Merging the datasets as mentioned above necessarily involves some fuzzy matching. So a small number of rows will be wrongly geocoded. The super red spot at Jardim Oceânico in the picture above is likely one of them. (Hopefully, this problem should be minimized by the fact that there are multiple data sources.)
Further, the data still needs to go through some validity checks I haven't had the time to run thoroughly. For example:
There's probably more stuff that I haven't even thought about, and I'd appreciate any ideas you might have. Some direct help actually validating the dataset would also be appreciated.