arrays with nonconforming metadata fields? (causing `.decompress_chunk` error)

grimbough / Rarr

A simple native R reader for Zarr Arrays

MIT License

34 stars 5 forks source link

first, thank you for this package @grimbough!

[EDIT - see below update. Some of my guessing here is off a little, but the issue is still valid]

I'm getting

"Error in .decompress_chunk(compressed_chunk, metadata) : zstd decompression error - error code: -70 (Destination buffer is too small)"

But I think this is due to NULL getting assigned to the datatype which then breaks the switch call in get_chunk_size in the following buffersize declaration. That in turn breaks the decompression .Call("decompress_chunk_ZSTD")

Assuming that's all correct, it looks like Rarr:::read_array_metadata and underlying .parse_datatype are ultimately where things start. Hopefully this reproduces for you:

failing as-is array_path <- "https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/zarr/chrtout.zarr/feature_id" metadata <- Rarr:::read_array_metadata(array_path) #decompressor <- metadata$compressor$id # decompressor == "zstd", previously/above 'lz4'

fails: NULL, key is '$dtype', value is '<i8' which breaks get_chunk_size() datatype <- metadata$datatype

still wrong, returns string "<i8" datatype <- metadata$dtype

should be list from datatype <- Rarr:::.parse_datatype(metadata$dtype) which allows declaration buffer_size <- Rarr:::get_chunk_size(datatype, dimensions = metadata$chunks)

I'm working on a hacky short term work around and can add to this issue anything useful, but I can see a few possibilities for package changes and I'm not sure of your preference(s) for how to handle this sort of thing going forward (especially as someone who doesn't use zarr much outside of this application and has no sense of how widespread this is likely to be).

Update - after much hacking around I noticed that my version of read_array_metadata (reinstalled this morning from BiocManager but ???) was missing the .parse_datatype line. With a remotes::install_github(repo = "grimbough/Rarr") and (various dependencies built from source), I now see that line, but it didn't resolve things.

I did finally manage to reach a correctly read vector of values for this example array, and (I think) for the encompassing problem of the streamflow data that I'm truly trying reach (https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/zarr/chrtout.zarr/streamflow/) despite that having "Data Type: int32". But I'd be happy to avoid my kludgy fix.

Moving quickly and not elegantly or generically, I ended up needing to write "my_" versions of read_data, .extract_elements, 'read_chunkand ultimately even.format_chunk`. I'm not sure how broadly relevant my fixes were, so no PR, but hopefully this helps if you decide any of this warrants package revisions.

Stepping through read_zarr_array, to get to numbers coming back, I:

coerced metadata$datatype$base_type <- "float" after declaring metadata; this fixes the "buffer size too small" from "<i8"
then declared my_read_data at the res <- read_data call to overwrite the FUN = .extract_elements, in the chunk_selections <- lapply declaration with my_extract_elements which other than some namespace additions was to allow a my_read_chunk in the chunk <- Rarr:::read_chunk declaration
in my_read_chunk I needed
- chunk_file <- paste0(zarr_array_path,"/", chunk_id) to fix what was otherwise a wrongly constructed s3_path ("feature_id0") and
- a my_format_chunk at the converted_chunk declaration to deal with values like "4.990063e-322" that I assume are somehow being generated in the .Call, possibly with some sprintf thrown in? Anyway, my not at all good solution was just to overwrite the output_type <- switch declaration to make float = 1L

HTH?

grimbough / Rarr

arrays with nonconforming metadata fields? (causing `.decompress_chunk` error) #10