mdsumner / azure-filesystem-abstraction

a parquet file rescued from azure
0 stars 0 forks source link

Download io-lulc-9-class.parquet in R #1

Open ctoney opened 5 months ago

ctoney commented 5 months ago

FWIW, we can copy it in R with:

library(gdalraster)

set_config_option("AZURE_STORAGE_ACCOUNT", "pcstacitems")
f <- "/vsiaz/items/io-lulc-9-class.parquet"

# get token with
# https://planetarycomputer.microsoft.com/api/sas/v1/sign?href=https://pcstacitems.blob.core.windows.net/items/io-lulc-9-class.parquet

set_config_option("AZURE_STORAGE_SAS_TOKEN", "st=2024...")  # your token
vsi_stat(f)
#> [1] TRUE
vsi_stat(f, "size")
#> integer64
#> [1] 16470860
vsi_copy_file(f, "data/io-lulc-9-class.parquet", show_progress = TRUE)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
#> [1] 0

Maybe we can use the URL signing mechanism with GDAL 3.9... VSICURL_PC_URL_SIGNING set to YES with gdalraster::vsi_set_path_option(). I haven't tried yet.

ctoney commented 5 months ago

VSI for the case of partitioned Parquet dataset - sentinel-2-l2a.parquet is a directory:

# e.g., sentintel-2-l2a partitioned by week

# token
# https://planetarycomputer.microsoft.com/api/sas/v1/sign?href=https://pcstacitems.blob.core.windows.net/items/sentinel-2-l2a.parquet

set_config_option("AZURE_STORAGE_SAS_TOKEN", "st=...")  # the token

f <- "/vsiaz/items/sentinel-2-l2a.parquet"
vsi_stat(f)
#> [1] TRUE
vsi_stat(f, "type")
#> [1] "dir"
head(vsi_read_dir(f))
#> [1] "part-0001_2015-06-29T10:25:31+00:00_2015-07-06T10:25:31+00:00.parquet"
#> [2] "part-0002_2015-07-06T10:25:31+00:00_2015-07-13T10:25:31+00:00.parquet"
#> [3] "part-0003_2015-07-13T10:25:31+00:00_2015-07-20T10:25:31+00:00.parquet"
#> [4] "part-0004_2015-07-20T10:25:31+00:00_2015-07-27T10:25:31+00:00.parquet"
#> [5] "part-0005_2015-07-27T10:25:31+00:00_2015-08-03T10:25:31+00:00.parquet"
#> [6] "part-0006_2015-08-03T10:25:31+00:00_2015-08-10T10:25:31+00:00.parquet"

f_week1 <- file.path(f, vsi_read_dir(f)[1])
vsi_stat(f_week1)
#> [1] TRUE
vsi_stat(f_week1, "type")
#> [1] "file"
vsi_stat(f_week1, "size")
#> integer64
#> [1] 664144
# vsi_copy_file(...)
mdsumner commented 5 months ago

ah nice, yes I'm on a journey of discovery finally realizing that abfs and etc are all bucket syntax ... honestly I had no idea, it all seemed a grab bag of confusing protocols!

I only needed to use CPL_DEBUG to see what GDAL does to get a real url lol ...

It's confusing how this all interacts (and when it's set by a package vs an env var is enough ...)

this works under some conditions that I'm unclear now, I can't reproduce but it worked ....

terra::vect("/vsicurl/https://pcstacitems.blob.core.windows.net/items/io-lulc-9-class.parquet")

(I'll explore all these things in time, awesome to see that we can leverage GDAL abstractions to download stuff and explore virtual file systems, I hadn't gotten to that yet! )

JosiahParry commented 5 months ago

I wanted to explore this a bit because it looked fun. I don't understand Azure blob storage....

# install via CRAN 
library(rstac)

stc <- stac("https://planetarycomputer.microsoft.com/api/stac/v1/")

all_assets <- stc |> 
  collections("io-lulc-9-class") |> 
  get_request() |> 
  assets_select()

# connect to blob storage?
acc_name <- all_assets$assets$`geoparquet-items`$`table:storage_options`$account_name
container_name <- all_assets$`msft:container`
blob_path <- glue::glue("https://{acc_name}.blob.core.windows.net/{container_name}")

AzureStor::blob_container(blob_path)
ctoney commented 5 months ago

For the Parquet file:

From the original blog post (https://blog.rtwilson.com/accessing-planetary-computer-stac-files-in-duckdb/)

Unfortunately, though, the URL provided by the STAC catalog looks like this: abfs://items/io-lulc-9-class.parquet

and

then set up the Azure connection: a = AzureBlobFileSystem(account_name='pcstacitems', sas_token=asset.extra_fields["table:storage_options"]['credential'])

So we have storage account: pcstacitems container: items

From https://gdal.org/user/virtual_file_systems.html#vsiaz-microsoft-azure-blob-files

Recognized filenames are of the form /vsiaz/container/key, where container is the name of the container and key is the object "key", i.e. a filename potentially containing subdirectories.

and

AZURE_STORAGE_ACCOUNT=value: Specifies storage account name. AZURE_STORAGE_SAS_TOKEN=value: (GDAL >= 3.2) Shared Access Signature.

We get a SAS token from the tokens API endpoint, documented here: https://planetarycomputer.microsoft.com/docs/concepts/sas/

The endpoint is https://planetarycomputer.microsoft.com/api/sas/v1/token/{storage_account}/{container}

so we need https://planetarycomputer.microsoft.com/api/sas/v1/token/pcstacitems/items

returns, for example (we use the value of "token"): {"msft:expiry":"2024-06-19T05:10:02Z","token":"st=2024-06-18T04%3A25%3A02Z&se=2024-06-19T05%3A10%3A02Z&sp=rl&sv=2024-05-04&sr=c&skoid=9c8ff44a-6a2c-4dfb-b298-1c9212f64d9a&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2024-06-17T02%3A00%3A44Z&ske=2024-06-24T02%3A00%3A44Z&sks=b&skv=2024-05-04&sig=2qIItS6tQ%2BcnR59Khrb1jXDYeIRXBbHn49XBl/YEunU%3D"}

This all translates to:

library(gdalraster)

set_config_option("AZURE_STORAGE_ACCOUNT", "pcstacitems")
f <- "/vsiaz/items/io-lulc-9-class.parquet"  # container/key, i.e., container/filename
set_config_option("AZURE_STORAGE_SAS_TOKEN", "st=2024...")  # fresh token string obtained as above

vsi_stat(f)  # exists?
vsi_stat(f, "type")
vsi_stat(f, "size")

# vsi_get_file_metadata(), vsi_copy_file(), ...
# or with directories, vsi_read_dir(), vsi_sync(), ...

To access the raster files instead:

The code from @JosiahParry above had all_assets

all_assets$links
2 [items] 
(https://planetarycomputer.microsoft.com/api/stac/v1/collections/io-lulc-9-class/items)

which gives

"assets":{"data":{"href":"https://ai4edataeuwest.blob.core.windows.net/io-lulc/nine-class/60W_20220101-20230101.tif","file:size":68031971,"raster:bands":[{"nodata":0,"spatial_resolution":10}],"type":"image/tiff; application=geotiff; profile=cloud-optimized","roles":["data"],"file:values":[{"values":[0],"summary":"No Data"},{"values":[1],"summary":"Water"},{"values":[2],"summary":"Trees"},{"values":[4],"summary":"Flooded vegetation"},{"values":[5],"summary":"Crops"},{"values":[7],"summary":"Built area"},{"values":[8],"summary":"Bare ground"},{"values":[9],"summary":"Snow/ice"},{"values":[10],"summary":"Clouds"},{"values":[11],"summary":"Rangeland"}]}

storage account: ai4edataeuwest container: io-lulc

Get a SAS token with: https://planetarycomputer.microsoft.com/api/sas/v1/token/ai4edataeuwest/io-lulc

set_config_option("AZURE_STORAGE_ACCOUNT", "ai4edataeuwest")
f <- "/vsiaz/io-lulc/nine-class"  # container/key, i.e., container/directory
set_config_option("AZURE_STORAGE_SAS_TOKEN", "st=2024...")  # fresh token string obtained as above

vsi_stat(f)  # exists?
vsi_stat(f, "type")  # "dir"
dirlist <- vsi_read_dir(f)
length(dirlist)
head(dirlist)

The point of doing this with the VSI functions is access at the file system level. That implies specific use cases, but was the problem originally posed by @mdsumner. He already solved it but I wanted to point out we can also do it all in R, and more, such as traverse/list directory structures. I was getting the SAS token with a web browser, but it would be straightforward to write a helper function since it's just an API request by either STAC collection ID or the Azure storage account/container (https://planetarycomputer.microsoft.com/docs/concepts/sas/).

The benefit of using the GDAL VSI functions is to abstract away most details of the different storage systems. Setting credentials is a "format" specific detail, but otherwise the code is the same for Azure Blob, Azure Data Lake, AWS, GCS, URLs, etc.

From package AzureStor description:

On the client side, it includes an interface to blob storage, file storage, and 'Azure Data Lake Storage Gen2': upload and download files and blobs; list containers and files/blobs; create containers; and so on. Authenticated access to storage is supported, via either a shared access key or a shared access signature (SAS).

For that piece, the client side file system operations, GDAL VSI does this without needing to use different packages for Azure, AWS, Google, ..., local file system, in-memory files (as long as we can set credentials).

goergen95 commented 5 months ago

Here's yet another one! Seems to work fine with a raster, but not with a vector data set?!

library(gdalraster)

f1 <- "/vsicurl?pc_url_signing=yes&pc_collection=ai4edataeuwest&url=https://ai4edataeuwest.blob.core.windows.net/io-lulc/nine-class/60W_20220101-20230101.tif"
vsi_stat(f1)
vsi_stat(f1, "size")
terra::rast(f1)

f2 <- "/vsicurl?pc_url_signing=yes&pc_collection=pcstacitems&url=https://pcstacitems.blob.core.windows.net/items/io-lulc-9-class.parquet"
vsi_stat(f2)
vsi_stat(f2, "size")
terra::vect(f2)

I also use the fact that with GDAL >3.5 prefix-specific credentials can be set in the GDAL config file.

ctoney commented 5 months ago

Thanks @goergen95. I'm curious to know more.

Apparently, we do not need signing for the GeoTiffs. I can do the following without using credentials:

f <- "/vsicurl/https://ai4edataeuwest.blob.core.windows.net/io-lulc/nine-class/60W_20220101-20230101.tif"
vsi_stat(f)
#> [1] TRUE
vsi_stat(f, "size")
#> integer64
#> [1] 68031971

But accessing the parquet dataset seems to need credentials

f <- "/vsicurl/https://pcstacitems.blob.core.windows.net/items/io-lulc-9-class.parquet"
vsi_stat(f)
#> [1] FALSE

but I'm not sure how to make pc_url_signing=yes work

f2 <- "/vsicurl?pc_url_signing=yes&pc_collection=pcstacitems&url=https://pcstacitems.blob.core.windows.net/items/io-lulc-9-class.parquet"
vsi_stat(f2)
#> [1] FALSE

Are you using the GDAL Config file to set AZURE_STORAGE_SAS_TOKEN or other credential? I'm not sure how to set the credential (i.e., SAS token) when using /vsicurl/ instead of /vsiaz/.

mdsumner commented 5 months ago

this worked for me

"/vsicurl?pc_url_signing=yes&url=https://pcstacitems.blob.core.windows.net/items/io-lulc-9-class.parquet"

no need for the collection in that expanded url

and for the shorter form I needed (I'm keeping notes for self as this was confusing me again yesterday)

export AZURE_STORAGE_ACCOUNT=pcstacitems
export AZURE_STORAGE_SAS_TOKEN="st=2024-..."
ogrinfo "/vsiaz/items/io-lulc-9-class.parquet"                                                                                                                          #INFO: Open of `/vsiaz/items/io-lulc-9-class.parquet'
 #     using driver `Parquet' successful.
#1: io-lulc-9-class (Multi Polygon)
mdsumner commented 5 months ago

I don't really understand what signing is for when you can automate around it, also yes it's weird that the tifs don't need it but the index does ... just seems like a weird limitation.

My ranty stuff is about there being this huge python infrastructure to hide these details, which are actually quite simple. I think the tower is a bit too involved now, and it doesn't help at all for doing this stuff in R.

Also, where are the libs? GDAL is now a complex of powerful C++ libraries of abstractions, making it a key touchstone across languages. Python is way ahead of R in terms of these wrappers, but do we really need them defined in each language ... The power of a single-file index for a world of data in a streamable-Parquet is immense, and we can use that with generic tools.

I think we need to agitate for resourcing key libraries, imagine an xarray or fsspec or kerchunk or zarr lib that was common across languages, that would be amazing.

goergen95 commented 5 months ago

I took the pc_url_signing=yes approach from the vsicurl docs. I also found some tests for this feature. Apparently, the syntax is right but it is still not working in this case?

I have not used the GDAL config approach in the cases above. I guess its more of a feature for (more or less) static secrets. The neat thing about it though is that we can set credentials for multiple path prefixes allowing us to seamlessly do e.g. gdal_translate /vsis3/from/here.tif /vsis3/to/here.tif with both paths requiring credentials as long as we have:

[credentials]

[.from-s3-bucket]
path = /vsis3/from
AZURE_STORAGE_ACCESS_TOKEN=<token>
...

[.to-s3-bucket]
path = /vsis3/to
AZURE_STORAGE_ACCESS_TOKEN=<token>
...
ctoney commented 5 months ago

Seems to work fine with a raster, but not with a vector data set?!

If it's failing at terra::vect(f2), possibly you don't have the GDAL Parquet driver? It is not built in by default. It requires the Parquet component of the Apache Arrow C++ library, and currently the common package distributions like ubuntugis-unstable and RTools/CRAN binaries do not include that driver, AFAIK. You could check with gdalraster::gdal_formats("Parquet") or gdalinfo --formats | grep Parquet.

For path-specific options, we also have vsi_set_path_option() / vsi_clear_path_options() in case it's ever helpful to do that programatically.

mdsumner commented 5 months ago

I'm also using very latest build of GDAL so maybe yours doesn't have the right support. There's been a lot of new fixes and features recently.

goergen95 commented 5 months ago

Thanks! It was the missing driver.

@mdsumner, correct me if I am wrong, but I was under the impression that most Python packages leverage fsspec to have access to different filesystems, not GDAL. The reason I am pondering a lot with GDAL lately is because I would actually need something like fsspec in R without requiring me to depend on gazillion of new packages and maintain code for each cloud provider. I think right now that is missing in R, but for geo-data it is already built-in with GDAL's virtual filesystems - so it is just the "easy" thing to do for me right now. In that context, I am not sure what you are proposing above. Are you advocating for a cross-lang fsspec, a cross-lang use of GDAL's VSI drivers, or something else completely? :smile:

ctoney commented 5 months ago

@mdsumner, definitely agree with all that.

Also find it odd that the tiffs don't require a token but the parquet does, which doesn't really make sense given the explanations in the documentation. In the context of that weirdness, I'm still trying to interpret the doc at https://planetarycomputer.microsoft.com/docs/concepts/sas/, e.g., the section "Rate limits and access restrictions". You don't need a subscription key to use it, but sounds like it works better if you have one.

mdsumner commented 5 months ago

@mdsumner, correct me if I am wrong, but I was under the impression that most Python packages leverage fsspec to have access to different filesystems, not GDAL.

That's right. I guess I'm saying that GDAL sits there with huge potential as a kind-of-fsspec that's already available in R (especially with the VSI support in gdalraster! and more that we could expose).

In an ideal world, fsspec would be a C++ library (or maybe better, Rust!) and we could provide identical support to leverage it in R and Python (and elsewhere. (It's a kind of uniqeuly-python issue, that so many things are built there as if everyone always uses it). The same problems exist in R, but I think we're more down the road of having cleaner exposure of underlying libs.

Are you advocating for a cross-lang fsspec, a cross-lang use of GDAL's VSI drivers, or something else completely? 😄

A bit of both, I think there's a need for more understanding the current situation/s - you see a lot of dialogue that is clearly through a downstream lens, look at sf or terra or rasterio and many others. (I "understand" a lot, but I'm still moving very quickly to understand a lot more, I have an increasing confidence that "I know what I'm talking about", actually. I'm looking for allies to pull a bit of a story and show and tell together too.)

I've had pushback in different situations ... GDAL is best via rasterio/fiona because there's no cross-platform support for osgeo.gdal/ogr ... xarray is truly multidimensional, not just geospatial raster (ok yes I know, but so are R arrays, just R doesn't have a lazy-fied array model like xarray or like the dbplyr/DBI abstraction). We can fix these things and I think we need a better community rallying to identify them. (Python is good at its own rallying, but doesn't listen to R folks very much).

mdsumner commented 5 months ago

I also wonder about the GDAL /vsiaz/ and other protocols. Why doesn't GDAL work with abfs:// ? Should it?

mdsumner commented 1 month ago

some of this should work, I think I got rate limited - the sentinel-2-l2a parquet is 30Gb or more

the geo parquet stac in python

#abfs://items/sentinel-2-l2a.parquet
collection = "sentinel-2-l2a"
from pystac_client import Client
import planetary_computer
import warnings
warnings.simplefilter(action='ignore', category=[FutureWarning, UserWarning])

import geopandas
catalog = Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)

asset = catalog.get_collection(collection).assets["geoparquet-items"]

df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)

df.to_parquet(f"{collection}.parquet")

the general pc parquet items

Sys.setenv("AZURE_STORAGE_ACCOUNT" = "pcstacitems")
collection <- "sentinel-2-l2a"; 
#collection <- "io-lulc-9-class"; 

token <- jsonlite::fromJSON(readLines(glue::glue("https://planetarycomputer.microsoft.com/api/sas/v1/token/{collection}")))$token
Sys.setenv(AZURE_STORAGE_SAS_TOKEN = token)

library(gdalraster)

dsn <- glue::glue("/vsicurl?pc_url_signing=yes&url=https://pcstacitems.blob.core.windows.net/items/{collection}.parquet")

new(GDALVector, dsn)