sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 132 forks source link

`clean_names.sf()` does not recognize `SHAPE` column as a `sfc` column when it contains multiple Geometry Types #578

Closed ar-puuk closed 4 days ago

ar-puuk commented 1 week ago

Feature Requests/Bug Report

While geometry is a commonly used name of the sfc_GEOMETRY column within a sf class object, the sf data loaded from ESRI GeoDatabase has the sfc_GEOMETRY in a column called SHAPE (or sometimes shape). When clean_names() is used in this sf object, the function renames the SHAPE to shape which essentially breaks the sf object and throws the following error when I try to view() the SF object after cleaning the names. I am assuming that since the sfc column is renamed, R essentially does not recognize the sf object anymore.

> object <- read_sf(
+    file.path("some/file/path.gdb")
+    layer = "layer"
+    ) %>%
+    clean_names()

> view(object)
Error in st_geometry.sf(x) : 
  attr(obj, "sf_column") does not point to a geometry column.
Did you rename it, without setting st_geometry(obj) <- "newname"?

One workaround I have been using is to rename the sfc column to geometry from SHAPE before using clean_names().

> object <- read_sf(
+    file.path("some/file/path.gdb")
+    layer = "layer"
+    ) %>%
+    rename(`geometry` = `SHAPE`) %>%
+    clean_names()

> view(object)

Would it be possible to internally identify SHAPE as a sfc_GEOMETRY column if geometry doesn't exist in the sf class object such that clean_names() doesn't try to rename it and break the object?

billdenney commented 1 week ago

I think that is readily possible. Can you please provide a small file as a reproducible example?

ar-puuk commented 1 week ago

I was apparently not entirely correct. The problem (to my understanding so far) is not about the name of the sfc_GEOMETRY column (SHAPE or geometry) or a source (SHP or GDB). I tried with other layers from GDB and that have SHAPE column; and I didn't get any issues.

But from what I just found out, it might have been because the sf object I am working with has multiple geometry types in a single sf object. However, it is still strange that the same error doesn't show up when the sfc column is named geometry rather than SHAPE.

This is the code:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(readr)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

projects <- readr::read_rds("D:/Projects.rds")

st_geometry_type(projects)
#>  [1] MULTILINESTRING MULTILINESTRING MULTIPOINT      MULTIPOINT     
#>  [5] MULTILINESTRING MULTILINESTRING MULTIPOINT      MULTIPOINT     
#>  [9] MULTILINESTRING MULTILINESTRING
#> 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE

projects_shape <- projects %>% 
  janitor::clean_names()

View(projects_shape) # Does not work

project_shape_transformed <- projects_shape %>% 
  sf::st_transform("EPSG:3857") # Does not work
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?

projects_geometry <- projects %>% 
  dplyr::rename(`geometry` = `SHAPE`) %>% 
  janitor::clean_names()

View(projects_geometry) # works

project_geometry_transformed <- projects_geometry %>% 
  sf::st_transform("EPSG:3857") # Works

Created on 2024-09-07 with reprex v2.1.1

Since, the problem doesn't seem to be originating from the GIS file format, rather than from having multiple geometry types, I am attaching the sf object as a zipped RDS file (cannot upload RDS directly). Projects.zip

billdenney commented 4 days ago

The underlying issue was that we renamed the attr(obj, "sf_column") column because it wasn't the last column name. So, I generalized the code to look at that attribute. Please let me know if that fixes the issue for you, and if so, we can merge it (assuming that the currently-running tests complete without issue).

ar-puuk commented 4 days ago

@billdenney That seems to have solved the issue. Thank you so much.