ipeaGIT / geobr

Easy access to official spatial data sets of Brazil in R and Python
https://ipeagit.github.io/geobr/
790 stars 118 forks source link

Problem with read_state when year = 2000 #205

Closed lcgodoy closed 3 years ago

lcgodoy commented 3 years ago

I tried to use this function to load the map of the Brazilian states for 2000 using the chunk of code below

my_dt <- geobr::read_state(code_state = "all", year = 2000)

and I ended up getting the following error message

#> Using year 2000
#> Loading data for the whole country
#>
#> |======================================================================| 100%
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#>  - 'VirtualXPath'    [XML Path Language - XPath]
#> Error in data.table::rbindlist(files, fill = TRUE) : 
#>  Class attribute on column 6 of item 23 does not match with column 6 of item 1.

After some debugging, I figured that the problem is in the function geobr:::load_gpkg. More specifically, the problem is when you use data.table::rbindlist to (row) bind the sf objects associated with each of the states. The problem is that we do not have a nice integration between the data.table and sf packages, resulting in a problem when (row) binding geometries of type MULTIPOLYGON and GEOMETRYCOLLECTION.

If you don't mind sacrificing the speed of the data.table::rbindlist function, this can be easily circumvented by using do.call along with rbind (since there exists a rbind.sf method that works nicely under these situations).

I can create Pull Request fixing it if that's ok.

rafapereirabr commented 3 years ago

Hi Lucas. Thanks for the heads up. I've found the problem. It's the Rio Grande do Sul state. For some reason, the original shape file of Rio Grande do Sul in the year 2000 comes as a GEOMETRYCOLLECTION

library(geobr)
library(sf)
library(magrittr)

rs <- geobr::read_state(code_state = 'RS', year = 2000, simplified = F)
head(rs)

Simple feature collection with 1 feature and 5 fields
geometry type:  GEOMETRYCOLLECTION
dimension:      XY
bbox:           xmin: -57.64332 ymin: -33.75158 xmax: -49.69114 ymax: -27.08011
geographic CRS: SIRGAS 2000
  code_state abbrev_state        name_state code_region name_region
1         43           RS Rio Grande Do Sul           4         Sul
                            geom
1 GEOMETRYCOLLECTION (POLYGON...

For some reason, I cannot convert it to a MULTIPOLYGON. This is the issue that needs to be solved. I can have a look into this next week. But please feel free to chip in if you have any ideas. Perhaps @pedro-andrade-inpe might also know how to address this.

rs2 <- sf::st_cast(rs, "MULTIPOLYGON")
> Error in m[1, ] : incorrect number of dimensions
lcgodoy commented 3 years ago

Yes, that's weird. I checked the geometries within this GEOMETRYCOLLECTION, one of them is a POLYGON (the map of RS itself) and the other is a LINESTRING. Bu running this

rs <- geobr::read_state("RS", year = 2000)
geom_col <- sf::st_geometry(rs)[1][[1]]
geom_col <- lapply(geom_col, sf::st_sfc)

plot(geom_col[[2]])
plot(geom_col[[1]], add = TRUE, border = 2)

plot(geom_col[[1]])
plot(geom_col[[2]], add = TRUE, col = 2)

it's possible to investigate these geometries.

A (messy) workaround is as follows

## the `st_cast` to `MULTIPOLYGON` is not necessary, I'm just keeping the same pattern as the other states)
rs <-
  sf::st_set_geometry(rs,
                      sf::st_cast(geom_col[[1]],
                                  "MULTIPOLYGON"))

However, an additional "if else" statement would be needed for the read_state function.

A better solution probably is to use to the do.call("rbind", files) instead of data.table::rbindlist in the load_dpkg function. (P.S.: I did not make a PR because it is just a matter of comment/uncomment a line of code)

rafapereirabr commented 3 years ago

Hi @lcgodoy . Thanks for the tip. The preferable solution here is really to clean the original data from IBGE. Not many people realize this, but the original files from IBGE can be quite messy and inconsistent over the years. One the contributions of geobr is precisely to harmonize and clean attributes, projetctions and geometries. I will fix this in the next couple of days.

On a side note, using data.table::rbindlist improves code performance by a lot. This makes a big difference whe the user downloads large data sets, such as all the municipalities or all the census tracts of Brazil.

rafapereirabr commented 3 years ago

Hi @lcgodoy . This issues has been fixed now. Thanks again for spotting this error and do let us know if you encounter any other issues.