ipeaGIT / geobr

Easy access to official spatial data sets of Brazil in R and Python
https://ipeagit.github.io/geobr/
786 stars 118 forks source link

Some Schools have wrong muni_name #280

Closed GoulartNogueira closed 2 years ago

GoulartNogueira commented 2 years ago

Trying to get the code_muni for each school, I've found some problems:

# Load
cidades = geobr.read_municipality(year=2020)
escolas = geobr.read_schools(year=2020)

# Merge
escolas['code_muni'] = escolas.merge(cidades, how='left', left_on=['abbrev_state', 'name_muni'], right_on=['abbrev_state', 'name_muni'])['code_muni']

# Check
print(f'{escolas.code_muni.isna().sum()/len(escolas)*100:.2f}% of schools have no municipality')
#>> 19.87% of schools have no municipality

Finally, I suggest to add the _codemuni for each school, as it's a more trustworthy index, comparing to _namemuni and state.

GoulartNogueira commented 2 years ago

Going further, lots of the non-match are due to the wrong Letter Case. If I lowercase both columns:

cidades['name_muni'] = cidades['name_muni'].str.lower().str.strip()
escolas['name_muni'] = escolas['name_muni'].str.lower().str.strip()

Then, the no-matching goes to 0.06%

Here is the full list of name_muni on school dataset with no correspondence in municipality dataset:

escolas[escolas.code_muni.isna()][['abbrev_state','name_muni']].value_counts()
abbrev_state name_muni count
BA santa teresinha 39
ES atílio vivacqua 16
RN augusto severo 16
CE ererê 15
PB quixabá 15
MG são thomé das letras 9
SC grão pará 9
TO fortaleza do tabocão 9
SE amparo de são francisco 7
MG dona eusébia 5
MG pingo d'água 4
GoulartNogueira commented 2 years ago

Here are some matches I found manually:

abbrev_state muni (school dataset) muni (municipality dataset)
BA Santa Teresinha Santa Terezinha
ES Atílio Vivacqua Atílio Vivácqua
RN Augusto Severo -
CE Ererê Ereré
PB Quixabá Quixaba
MG São Thomé Das Letras São Tomé Das Letras
SC Grão Pará Grão-Pará
TO Fortaleza Do Tabocão -
SE Amparo De São Francisco Amparo Do São Francisco
MG Dona Eusébia Dona Euzébia
MG Pingo D'Água Pingo-D'Água

Most names have a single-character error, like an accent, hyphen or S-Z changed.

rafapereirabr commented 2 years ago

Hi @GoulartNogueira . Thank you for bringing this issue to our attention. The municipality data comes from IBGE, while the schools data come from INEP. This is to say two things:

  1. The municipality names in IBGE are probably the correct ones.
  2. Given the incompatibility between municipality names in both data sets, doing a conventional merge is probably not a good idea. I would suggest using a spatial intersection in this case.
GoulartNogueira commented 2 years ago

Thank you for the quick answer! I tried brute force checking all schools versus all cities, but:

It takes more than one hour (using the simplified geometry).

Also, some schools fall into 2 cities and others fall into none.

So I guess I could use the state + city name and then just double check using spacial data.

Anyways, after I make a trustful merge, can I somehow upload the new muni_code to the schools table, to help other users? I want to contribute to the community.

rafapereirabr commented 2 years ago

The fact that "some schools fall into 2 cities and others fall into none" is a geocoding problem in the original data set at Inep. I know they are continuously improving this geocoging, so hopefully this won't be an issue in the next update of the data. There are also a few complicated situations because the muni_name in the data set refers to the municipal government officially responsible for the school, but the school location can be very close to the border and imprecise.