r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.34k stars 299 forks source link

sf::st_layers hanging R / bombing rstudio #1995

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 2 years ago

When I try to sf::st_layers() on the attached gpkg, rstudio bombs. When I list from a terminal, I see:

> library(sf)
Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
> sf::st_layers("nav_06.gpkg")
        - 'VirtualXPath'        [XML Path Language - XPath]
Driver: GPKG
Available layers:
        layer_name geometry_type features fields
1 unassigned_gages         Point        6     16
2     split_events         Point       91      6
3      POIs_tmp_06         Point     2201     12

Reproduce with:

tmpf <- tempfile()
download.file("https://github.com/r-spatial/sf/files/9537145/nav_06.zip", tmpf)
utils::unzip(tmpf, "nav_06.gpkg")
sf::st_layers("nav_06.gpkg")
``` GEOS GDAL proj.4 "3.9.1" "3.4.3" "7.2.1" GDAL_with_GEOS USE_PROJ_H PROJ "true" "true" "7.2.1" ``` ``` R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base loaded via a namespace (and not attached): [1] compiler_4.2.1 tools_4.2.1 ``` Paste the output of your `sessionInfo()` and `sf::sf_extSoftVersion()`

nav_06.zip

kalibera commented 2 years ago

For build 5286-5107 of Rtools42 (and subsequent), could those testing please note that sf src/Makefile.ucrt needs updating to add in -lkea -lhdf5_hl -hdf5_cpp where -lhdf5_hl is now. In response to a user request, the KEA driver was added in this build of Rtools42. The same applies to terra, etc. rtools42-5253-5107-signed.exe is released Rtools42 matching the src/Makefile.ucrt files currently present in the source packages. @kalibera is there a way of conditioning on the Rtools build number in src/Makefile.ucrt - if so, do you have an example of a package using version conditioning?

Yes, the UCRT3 builds (using R-devel and unreleased version of Rtools42) automatically apply patches, which also adjust the linking, but if one tests using R 4.2. and unreleased Rtools42, one has to apply the patch manually. The patches used in the last build are available here: https://www.r-project.org/nosvn/winutf8/ucrt3/patches/CRAN/

The patches change the linking unconditionally. There is no prescribed way of conditioning as far as I know and it is a rare problem (4 packages currently have patches changing linking). Perhaps you could check for presence of files (headers, libraries), a bit like configure does, so e.g. check for kea. A new major release of R might require bigger changes, but then one could in principle conditionalize on R version. An installation of Rtools42 (from installer and toolchain bundle) has meta-data in files (.version, also list of MXE packages), but I would myself choose rather checks for presence of files as they are more general and portable. If anyone still had preference for the meta-data files, I think the right way would be to first discuss with the CRAN repository maintainers.

rsbivand commented 2 years ago

Ok, @kalibera , hot patching makes good sense, thanks.

rsbivand commented 2 years ago

Here are binary builds of terra and sf with the 5336 libgdal - one 7z archive for each of R-devel and R-4.2.1: https://drive.google.com/drive/folders/1aG5XsjkPAO35aUWLSImlB8_eWg47Gxno?usp=sharing. Please do not trust these too much, they pass R CMD check, but will not be identical with those from the actual build system. The package source code was manually patched for the reinstated KEA driver.

Gdrive folder updated with copies of the source tarballs and fresh builds of the binaries from these tarballs, about Sept. 15, 10:30 CEST.

kalibera commented 2 years ago

Thanks, I am now working on a bigger update of Rtools42, which will include more than this work-around, so the packages built on my system will get overwritten with even newer ones. So your builds may be good for reference if anyone needed to test only this change and nothing more.

rouault commented 2 years ago

https://github.com/OSGeo/gdal/pull/6359 removes the use of std::regex in the OGR SQLite&GPKG drivers

rsbivand commented 2 years ago

@rouault: thanks! As I read the commit, was the previous code looking in the field string rather than the table definition for the unique constraint declaration? A pity we didn't hit this issue before 3.5.2 was released.

dblodgett-usgs commented 2 years ago

@rsbivand -- sorry to be dense, how should I be installing these binary builds? I'm getting errors with devtools::install(... quick=TRUE)

rouault commented 2 years ago

in the field string rather than the table definition for the unique constraint declaration?

The previous version looked at fieldStr, which contained each column definition, extracted from tableDefinition, split on comma. Anyway the new version should hopefully be at least as good, and probably better, than the previous one, as I assume the previous one might have had issues if comma were found in column name or DEFAULT 'some default value, with comma' clauses.

rsbivand commented 2 years ago

@dblodgett-usgs install.packages("sf_1.0-9.zip")

dblodgett-usgs commented 2 years ago

🎉 You guys are 🧙 level.

 library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.5.0, PROJ 7.2.1; sf_use_s2() is TRUE
 nc <- read_sf(system.file("gpkg/nc.gpkg", package = "sf"))
 names(nc)[4] <- "cnty_unique_id"
 write_sf(nc, "nc_test.gpkg")
 read_sf("nc_test.gpkg")
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 100 × 15
#>     AREA PERIMETER CNTY_ cnty_u…¹ NAME  FIPS  FIPSNO CRESS…² BIR74 SID74 NWBIR74
#>    <dbl>     <dbl> <dbl>    <dbl> <chr> <chr>  <dbl>   <int> <dbl> <dbl>   <dbl>
#>  1 0.114      1.44  1825     1825 Ashe  37009  37009       5  1091     1      10
#>  2 0.061      1.23  1827     1827 Alle… 37005  37005       3   487     0      10
#>  3 0.143      1.63  1828     1828 Surry 37171  37171      86  3188     5     208
#>  4 0.07       2.97  1831     1831 Curr… 37053  37053      27   508     1     123
#>  5 0.153      2.21  1832     1832 Nort… 37131  37131      66  1421     9    1066
#>  6 0.097      1.67  1833     1833 Hert… 37091  37091      46  1452     7     954
#>  7 0.062      1.55  1834     1834 Camd… 37029  37029      15   286     0     115
#>  8 0.091      1.28  1835     1835 Gates 37073  37073      37   420     0     254
#>  9 0.118      1.42  1836     1836 Warr… 37185  37185      93   968     4     748
#> 10 0.124      1.43  1837     1837 Stok… 37169  37169      85  1612     1     160
#> # … with 90 more rows, 4 more variables: BIR79 <dbl>, SID79 <dbl>,
#> #   NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>, and abbreviated variable names
#> #   ¹​cnty_unique_id, ²​CRESS_ID

 sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] sf_1.0-9
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.9.1       compiler_4.2.1     pillar_1.8.1       highr_0.9         
#>  [5] class_7.3-20       tools_4.2.1        digest_0.6.29      evaluate_0.16     
#>  [9] lifecycle_1.0.2    tibble_3.1.8       pkgconfig_2.0.3    rlang_1.0.5       
#> [13] reprex_2.0.2       DBI_1.1.3          cli_3.4.0          rstudioapi_0.14   
#> [17] yaml_2.3.5         xfun_0.32          fastmap_1.1.0      e1071_1.7-11      
#> [21] withr_2.5.0        stringr_1.4.1      dplyr_1.0.10       knitr_1.40        
#> [25] generics_0.1.3     fs_1.5.2           vctrs_0.4.1        tidyselect_1.1.2  
#> [29] classInt_0.4-7     grid_4.2.1         glue_1.6.2         R6_2.5.1          
#> [33] fansi_1.0.3        rmarkdown_2.16     purrr_0.3.4        magrittr_2.0.3    
#> [37] htmltools_0.5.3    units_0.8-0        KernSmooth_2.23-20 utf8_1.2.2        
#> [41] stringi_1.7.8      proxy_0.4-27

sf_extSoftVersion()
#>           GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
#>        "3.9.1"        "3.5.0"        "7.2.1"         "true"         "true" 
#>           PROJ 
#>        "7.2.1"

Created on 2022-09-14 with reprex v2.0.2

rsbivand commented 2 years ago

Thanks! I'll update the 7z's tomorrow with sources (done 15/9 10:30 CEST). Actually the issue has helped improve GDAL too, so a "unique" contribution.

kalibera commented 2 years ago

To catch things like this, sooner, on Windows, one would have to run tests in UTF-8, for which one would have to use UCRT and a UCRT toolchain, and one would have to set UTF-8 to be the native encoding via the fusion manifest (at build time via the manifest file, as R does it). It could be done using R/R packages and Rtools42 (MXE with some updates). There are number of patches that are now being done downstream to make gdal work (https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/ucrt3/toolchain_libs/mxe/src/gdal-1-fixes.patch) and as always, it would be nice if the upstream code already could be built and used with UCRT without such downstream patching.

So, it would be nice if the upstream MXE could have up to date versions of gdal etc (not having to patch downstream in Rtools42) and if upstream gdal etc could work with UCRT/UTF-8 (not having to patch downstream in Rtools42). That would increase the chances problems are found in time and solved properly.

rsbivand commented 2 years ago

@kalibera This recalls https://blog.r-project.org/2019/03/28/use-of-c-in-packages/index.html, but here the problem has been rushing forward without awareness that saying std:: doesn't mean problem solved for ever across changing platforms. Now we probably need to screen both packages and external software for use of std::regex not only for UCRT but because it may be withdrawn in the future for all platforms; @eddelbuettel, might Rcpp screen relevant packages? For external software, we have Windows and macOS binary builds with our own builds of libraries, but otherwise we'll have to rely on the upstream packagers or library maintainers. Here the rapid response by @rouault is a great example.

kalibera commented 2 years ago

My blog was primarily against using C++ to interface with R, not against C++ in principle. If a library independent on R internally uses C++, without ever calling back to R (or allocating from R heap, throwing exceptions/jumping around R stack frames), that's ok. Yes, checking C++ in R packages for potential use of std::regex would be useful.

eddelbuettel commented 2 years ago

Yes, checking C++ in R packages for potential use of std::regex would be useful.

So let R CMD check do it. Rcpp has nothing to do with any of this, neither mandates nor promotes any C++ components. And of course only uses .Call(), a C interface, to communicate with R.

rsbivand commented 2 years ago

So CRAN and Bioc teams would be the ones to check.

rhijmans commented 2 years ago

Here is another GDAL issue that appears to be related to the R toolchain on Windows. In short, this HDF5 file can be read on linux but not windows. This problem happens with "terra" and also with "stars".

LINUX

x <- rast("hd2011010000.scu")
#Warning message:
#[rast] unknown extent
y <- x * 1

y
#class       : SpatRaster
#dimensions  : 536, 536, 1  (nrow, ncol, nlyr)
#resolution  : 1, 1  (x, y)
#extent      : 0, 536, 0, 536  (xmin, xmax, ymin, ymax)
#coord. ref. :
#source      : memory
#name        : hd2011010000
#min value   :         -999
#max value   :           15

WINDOWS

x <- rast("hd2011010000.scu")
#Warning message:
#[rast] unknown extent
y <- x * 1
#Warning messages:
#1: H5Dread() failed for block. (GDAL error 1) 
#2: HDF5:"hd2011010000.scu"://dataset_DXk/image, band 1: IReadBlock failed at X offset 0, Y offset 0: 
#                 H5Dread() failed for block. (GDAL error 1) 

y
#class       : SpatRaster 
#dimensions  : 536, 536, 1  (nrow, ncol, nlyr)
#resolution  : 1, 1  (x, y)
#extent      : 0, 536, 0, 536  (xmin, xmax, ymin, ymax)
#coord. ref. :  

The second set of warnings should really be an Error because y has no values. and that is what "stars" give you:

s <- stars::read_stars("hd2011010000.scu")
#Error in CPL_read_gdal(as.character(x), as.character(options), as.character(driver),  : 
#  read failure
#In addition: Warning messages:
#1: In CPL_read_gdal(as.character(x), as.character(options), as.character(driver),  :
#  GDAL Error 1: H5Dread() failed for block.
#2: In CPL_read_gdal(as.character(x), as.character(options), as.character(driver),  :
#  GDAL Error 1: HDF5:"hd2011010000.scu"://dataset_DXk/image, band 1: IReadBlock failed at X #offset 0, Y offset 0: H5Dread() failed for block.
rsbivand commented 2 years ago

@rhijmans which versions of the library used by the driver are being used? The windows version will be that of MXE in the ucrt3 builds, presumably released Rtools42, not my test build with the unreleased version of libgdal, which version in your linux, and how was thst gdal built and installed?

kalibera commented 2 years ago

Here is another GDAL issue that appears to be related to the R toolchain on Windows.

Please note that std::regex is unreliable also on other platforms (seen on macOS at least with R packages as I've learned now). It is not specific to Windows (nor to Rtools42).

rhijmans commented 2 years ago

@rsbivand: This fails on windows with the old (in R 3.6, GDAL 3.2.1) and new (UCRT) RTools using GDAL 3.4.3. It works for me on Ubuntu or Debian with GDAL 2.2.3, 2.4.0, and 3.4.0, and you showed that it works in Fedora with GDAL 3.5.2. (and you refer to the HDF lib versions as well, which may be helpful).

@kalibera: point well taken that the std::regex problems are not a Rtools issue. But this seems to be caused by how the HDF5 lib is built for R on windows.

@rouault solved a (perhaps) similar issue with HDF and OSGeo4W

rsbivand commented 2 years ago

@rhijmans From reading the OSGeo4W issue, it may be that there is something odd happening when HDF5 is static rather than dynamic too. It would be interesting to see whether the error seen in R on Windows binary packages can be reproduced without R and the packages using MXE-built GDAL apps. These are not distributed, so someone will have to run a custom cross-compilation. Can you establish whether there is a useful HDF5 test suite on Windows before the HDF5 libraries are used when GDAL is built? Are there issues raised about Windows builds for HDF5 itself?

kalibera commented 2 years ago

This thread may be hard to follow when this is no longer about the std::regex.

Still, I think the problem with HDF5 is that the szip is not supported (it is not part of MXE), but the file uses it. The output with the current (unreleased yet) UCRT3 Rtools42 build is:

> library(terra)
terra 1.6.17
> x <- rast("hd2011010000.scu")
Warning message:
[rast] unknown extent

> y <- x * 1
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0:
  #000: ../../src/H5Dio.c line 192 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: ../../src/H5VLcallback.c line 2080 in H5VL_dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #002: ../../src/H5VLcallback.c line 2046 in H5VL__dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #003: ../../src/H5VLnative_dataset.c line 167 in H5VL__native_dataset_read(): can't read data
    major: Dataset
    minor: Read failed
  #004: ../../src/H5Dio.c line 567 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #005: ../../src/H5Dchunk.c line 2594 in H5D__chunk_read(): unable to read raw data chunk
    major: Low-level I/O
    minor: Read failed
  #006: ../../src/H5Dchunk.c line 3957 in H5D__chunk_lock(): data pipeline read failed
    major: Dataset
    minor: Filter operation failed
  #007: ../../src/H5Z.c line 1311 in H5Z_pipeline(): required filter 'szip' is not registered
    major: Data filters
    minor: Read failed
  #008: ../../src/H5PLint.c line 274 in H5PL_load(): search in path table failed
    major: Plugin for dynamically loaded library
    minor: Can't get value
  #009: ../../src/H5PLpath.c line 604 in H5PL__find_plugin_in_path_table(): search in path C:\Progra
mData\hdf5\lib\plugin encountered an error
    major: Plugin for dynamically loaded library
    minor: Can't get value
  #010: ../../src/H5PLpath.c line 734 in H5PL__find_plugin_in_path(): can't open directory
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
Warning messages:
1: H5Dread() failed for block. (GDAL error 1)
2: HDF5:"C:/msys64/home/tomas/hd2011010000.scu"://dataset_DXk/image, band 1: IReadBlock failed at X
offset 0, Y offset 0: H5Dread() failed for block. (GDAL error 1)

and szip is indeed disabled in the HDF5 build.

rsbivand commented 2 years ago

@kalibera Thanks! I followed up on https://github.com/rspatial/terra/issues/686 as the initial terra issue.