grimbough / Rarr

A simple native R reader for Zarr Arrays
https://bioconductor.org/packages/Rarr/
MIT License
34 stars 5 forks source link

arrays with nonconforming metadata fields? (causing `.decompress_chunk` error) #10

Open daauerbach opened 1 day ago

daauerbach commented 1 day ago

first, thank you for this package @grimbough!

[EDIT - see below update. Some of my guessing here is off a little, but the issue is still valid]

I'm getting

"Error in .decompress_chunk(compressed_chunk, metadata) : zstd decompression error - error code: -70 (Destination buffer is too small)"

But I think this is due to NULL getting assigned to the datatype which then breaks the switch call in get_chunk_size in the following buffersize declaration. That in turn breaks the decompression .Call("decompress_chunk_ZSTD")

Assuming that's all correct, it looks like Rarr:::read_array_metadata and underlying .parse_datatype are ultimately where things start. Hopefully this reproduces for you:

failing as-is array_path <- "https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/zarr/chrtout.zarr/feature_id" metadata <- Rarr:::read_array_metadata(array_path) #decompressor <- metadata$compressor$id # decompressor == "zstd", previously/above 'lz4'

fails: NULL, key is '$dtype', value is '<i8' which breaks get_chunk_size() datatype <- metadata$datatype

still wrong, returns string "<i8" datatype <- metadata$dtype

should be list from datatype <- Rarr:::.parse_datatype(metadata$dtype) which allows declaration buffer_size <- Rarr:::get_chunk_size(datatype, dimensions = metadata$chunks)

I'm working on a hacky short term work around and can add to this issue anything useful, but I can see a few possibilities for package changes and I'm not sure of your preference(s) for how to handle this sort of thing going forward (especially as someone who doesn't use zarr much outside of this application and has no sense of how widespread this is likely to be).

daauerbach commented 1 day ago

Update - after much hacking around I noticed that my version of read_array_metadata (reinstalled this morning from BiocManager but ???) was missing the .parse_datatype line. With a remotes::install_github(repo = "grimbough/Rarr") and (various dependencies built from source), I now see that line, but it didn't resolve things.

I did finally manage to reach a correctly read vector of values for this example array, and (I think) for the encompassing problem of the streamflow data that I'm truly trying reach (https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/zarr/chrtout.zarr/streamflow/) despite that having "Data Type: int32". But I'd be happy to avoid my kludgy fix.

Moving quickly and not elegantly or generically, I ended up needing to write "my_" versions of read_data, .extract_elements, 'read_chunkand ultimately even.format_chunk`. I'm not sure how broadly relevant my fixes were, so no PR, but hopefully this helps if you decide any of this warrants package revisions.

Stepping through read_zarr_array, to get to numbers coming back, I:

HTH?