wcjochem / sfarrow

R package for reading/writing `sf` objects from/to parquet files with `arrow`.
https://wcjochem.github.io/sfarrow/
Other
75 stars 4 forks source link

"snappy" support #10

Closed darribas closed 2 years ago

darribas commented 3 years ago

I'm trying the following to interface geopandas and sf:

import geopandas
db = geopandas.read_file(
    (
        "http://d2ad6b4ur7yvpq.cloudfront.net/"\
         "naturalearth-3.3.0/ne_110m_land.geojson"
    )
)
db.to_parquet("db.pq")

Then on R:

library(sf)
library(sfarrow)

db <- st_read_parquet("db.pq")

And I get the following error:

Error: IOError: NotImplemented: Support for codec 'snappy' not built

I suspect this refers to the codec used by geopandas to write arrow objects to parquet file, but not sure what the differences would be?

cc'ing @jorisvandenbossche and @martinfleis in case they have any clues

martinfleis commented 3 years ago

This refers to the default parquet compression used in (geo)pandas. You can try using some other {‘snappy’, ‘gzip’, ‘brotli’, None}, or None.

darribas commented 3 years ago

OK can confirm writing in geopandas with None works on sfarrow

wcjochem commented 3 years ago

Thanks for raising this. Just to add that compression does work within sfarrow, but it depends on your arrow installation and the codecs are properlly installed. I've had some past issues with arrow on my Ubuntu system. This is also discussed here: https://arrow.apache.org/docs/r/articles/install.html

darribas commented 3 years ago

Super! I did install sfarrow on top of an already installed R stack (the 6.1 of the gds_env. For the next release of the container though I'd love to get this packaged and properly installed. I suspect installing at the same time as sf, rgeos, etc. will take care of conflilcts but I might reach out if I run into issues...

darribas commented 3 years ago

question in the meantime: do you know if arrow binaries get installed when you install sfarrow or is it a dependency you have to deal with on your own before installing it?

wcjochem commented 3 years ago

The arrow package should automatically get installed when sfarrow is installed if it's not found. The binaries for the Arrow C++ library should be installed by the arrow R package installation according to https://arrow.apache.org/docs/r/#installation. You (hopefully) shouldn't have to deal with any of it on your own before installing sfarrow.

However, my experience is that full Arrow support isn't always included with that default installation of the arrow package. Specifically the support for the 'snappy' codec that you mentioned earlier (and other compressions).

On Ubuntu 20.04, I have to use the following within R to get the Arrow library and arrow package to support snappy:

Sys.setenv(ARROW_S3="ON")
Sys.setenv(NOT_CRAN="true")
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")

From: https://stackoverflow.com/questions/64937524/r-arrow-error-support-for-codec-snappy-not-built

I hope that helps.

darribas commented 2 years ago

Just a quick update, I can confirm the strategy above seems to work fine and will ship on the gds_env:7.0.