rstudio / pointblank

Data quality assessment and metadata reporting for data frames and database tables
https://rstudio.github.io/pointblank/
Other
872 stars 56 forks source link

Checks for sf-objects #20

Open TSchiefer opened 5 years ago

TSchiefer commented 5 years ago

I intended to check key-properties of sf(c)-objects making use of rows_not_duplicated(). The check was supposed to ignore the geometry column of the object (cf. 2nd example in reprex).

It seems that interrogate() ran into an error, because of the way, summarize() works on these objects.

Reprex example:

library(pointblank)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

# Geometry object with 2 features
g <- rep(st_sfc(st_point(1:2)), 2)

# vector with 2 entries
v <- c("a", "b")

# object including both objects
mixed_obj <- st_sf("vector" = v, "points" = g)
mixed_obj
#> Simple feature collection with 2 features and 1 field
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 2 xmax: 1 ymax: 2
#> epsg (SRID):    NA
#> proj4string:    NA
#>   vector      points
#> 1      a POINT (1 2)
#> 2      b POINT (1 2)

agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated() %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

# It already happens, when I only check if column "vector" is duplicated 
# (likely because `sf`-objects have "sticky geometries")
agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated(cols = vector) %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

Created on 2019-02-12 by the reprex package (v0.2.1)

I think it happens at the following chunk in interrogate() in the section "# Judge tables on expectation of non-duplicated rows":

      # Get total count of rows
      row_count <-
        table %>%
        dplyr::group_by() %>%
        dplyr::summarize(row_count = n()) %>%
        dplyr::as_tibble() %>%
        purrr::flatten_dbl()

My expectation would be, that

  1. in the first case of the reprex (rows_not_duplicated(), without specifying columns) each whole row, including the geometry column, would be compared with the others.
  2. in the second case (rows_not_duplicated(cols = vector)) the check would be done only for the column "vector".

Perhaps a solution might be to call as_tibble() before group_by() and summarize()?

CC: @krlmlr

higgi13425 commented 2 years ago

Just following up on this with what seems to be a related issue. scan_data does not appear to work on sf data Data on bus_stops downloaded from https://data.a2gov.org/feeds/GIS/AATA%20BusStops/AATA_Bus_Stops.shp.xml

It just appears to stop with Error in sum(as.vector(t(collected))) : invalid 'type' (list) of argument

Example code (and error) below R> library(sf) Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1 R> bus_data <- sf::st_read('~/Downloads/AATABusStops/AATABusStops.shp') Reading layerAATABusStops' from data source /Users/peterhiggins/Downloads/AATABusStops/AATABusStops.shp' using driverESRI Shapefile' Simple feature collection with 1616 features and 12 fields Geometry type: POINT Dimension: XY Bounding box: xmin: -84.02867 ymin: 42.21356 xmax: -83.48754 ymax: 42.32714 Geodetic CRS: NAD83 R> pointblank::scan_data(bus_data)

── Data Scan started. Processing 6 sections. ─── ℹ Starting assembly of 'Overview' section... Error in sum(as.vector(t(collected))) : invalid 'type' (list) of argument R> class(bus_data) [1] "sf" "data.frame"`