saalfeldlab / n5-zarr

Zarr filesystem backend for N5
BSD 2-Clause "Simplified" License
12 stars 14 forks source link

Add Zstandard #35

Closed mkitti closed 10 months ago

mkitti commented 10 months ago

Add Zstandard dependency

        <dependency>
            <groupId>org.janelia</groupId>
            <artifactId>n5-zstandard</artifactId>
            <version>1.0.1</version>
        </dependency>
bogovicj commented 10 months ago

Turns out the unit tests are not quite thorough enough :-/

@mkitti could you please merge https://github.com/saalfeldlab/n5-zarr/commits/zstandard/ into your branch?

I'm also running into an issue when using zarr-python to read data written by n5-zarr

example

Write the data:

final String root = "...";
final N5Writer zarr = new N5Factory().openWriter( root );
final String dset = "simple-zst";
ArrayImg<UnsignedByteType, ByteArray> img = ArrayImgs.unsignedBytes(new byte[]{0,1,2,3,4,5,6,7,8,9,10,11}, 12);
N5Utils.save(img, zarr, dset, new int[]{12}, new ZstandardCompression());

Read the data:

import zarr
root = zarr.open('zstd-test.zarr')
arr = root['n5-test/simple-zst']
arr[:]
results in this error ``` RuntimeError Traceback (most recent call last) Cell In[3], line 4 2 root = zarr.open('zstd-test.zarr') 3 arr = root['n5-test/simple-zst'] ----> 4 arr[:] File ~/.local/lib/python3.10/site-packages/zarr/core.py:844, in Array.__getitem__(self, selection) 842 result = self.get_orthogonal_selection(pure_selection, fields=fields) 843 else: --> 844 result = self.get_basic_selection(pure_selection, fields=fields) 845 return result File ~/.local/lib/python3.10/site-packages/zarr/core.py:970, in Array.get_basic_selection(self, selection, out, fields) 968 return self._get_basic_selection_zd(selection=selection, out=out, fields=fields) 969 else: --> 970 return self._get_basic_selection_nd(selection=selection, out=out, fields=fields) File ~/.local/lib/python3.10/site-packages/zarr/core.py:1012, in Array._get_basic_selection_nd(self, selection, out, fields) 1006 def _get_basic_selection_nd(self, selection, out=None, fields=None): 1007 # implementation of basic selection for array with at least one dimension 1008 1009 # setup indexer 1010 indexer = BasicIndexer(selection, self) -> 1012 return self._get_selection(indexer=indexer, out=out, fields=fields) File ~/.local/lib/python3.10/site-packages/zarr/core.py:1388, in Array._get_selection(self, indexer, out, fields) 1385 if math.prod(out_shape) > 0: 1386 # allow storage to get multiple items at once 1387 lchunk_coords, lchunk_selection, lout_selection = zip(*indexer) -> 1388 self._chunk_getitems( 1389 lchunk_coords, 1390 lchunk_selection, 1391 out, 1392 lout_selection, 1393 drop_axes=indexer.drop_axes, 1394 fields=fields, 1395 ) 1396 if out.shape: 1397 return out File ~/.local/lib/python3.10/site-packages/zarr/core.py:2228, in Array._chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection, drop_axes, fields) 2226 for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection): 2227 if ckey in cdatas: -> 2228 self._process_chunk( 2229 out, 2230 cdatas[ckey], 2231 chunk_select, 2232 drop_axes, 2233 out_is_ndarray, 2234 fields, 2235 out_select, 2236 partial_read_decode=partial_read_decode, 2237 ) 2238 else: 2239 # check exception type 2240 if self._fill_value is not None: File ~/.local/lib/python3.10/site-packages/zarr/core.py:2098, in Array._process_chunk(self, out, cdata, chunk_selection, drop_axes, out_is_ndarray, fields, out_selection, partial_read_decode) 2096 if isinstance(cdata, PartialReadBuffer): 2097 cdata = cdata.read_full() -> 2098 self._compressor.decode(cdata, dest) 2099 else: 2100 if isinstance(cdata, UncompressedPartialReadBufferV3): File numcodecs/zstd.pyx:219, in numcodecs.zstd.Zstd.decode() File numcodecs/zstd.pyx:153, in numcodecs.zstd.decompress() RuntimeError: Zstd decompression error: invalid input data ```
mkitti commented 10 months ago

Where does N5Factory().openWriter( root ) come from? I don't see that method in N5Utils?

mkitti commented 10 months ago

This seems to be a bug in zarr-developers/numcodecs. There they use the C function ZSTD_getDecompressedSize:

https://github.com/zarr-developers/numcodecs/blob/366318f3b82403fe56db5ae647f8747e7a4aaf38/numcodecs/zstd.pyx#L151C21-L153

According to the Zstandard manual, that routine is now deprecated. One issue with it is that it returns 0 if the result is empty, unknown, or if an error has occurred.

https://facebook.github.io/zstd/zstd_manual.html

The numcodecs bug is that they assume that a value of 0 means error. In this case, it actually means unknown. I know it means unknown since I used the function ZSTD_getFrameContentSize and that returns 0xffffffffffffffff or ZSTD_CONTENTSIZE_UNKNOWN.

mkitti commented 10 months ago

Here's what my current test class looks like:

package org.janelia.saalfeldlab.n5.zarr;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

import org.janelia.saalfeldlab.n5.N5Writer;
import org.janelia.saalfeldlab.n5.imglib2.N5Utils;
import org.janelia.scicomp.n5.zstandard.ZstandardCompression;
import org.junit.Test;

import com.github.luben.zstd.Zstd;

import net.imglib2.img.array.ArrayImg;
import net.imglib2.img.array.ArrayImgs;
import net.imglib2.img.basictypeaccess.array.ByteArray;
import net.imglib2.type.numeric.integer.UnsignedByteType;

public class ZstandardTest {

    @Test
    public void testZstandard() throws IOException {
        final String root = "/home/mkitti/eclipse-workspace/n5-zarr/test.zarr";
        final N5Writer zarr = new N5ZarrWriter(root);
        final String dset = "simple-zst";
        final byte[] bytes = new byte[1024*1024];
        for(int i=0; i < bytes.length; ++i) {
            bytes[i] = (byte)(i*5-128);
        }
        //bytes = new byte[]{0,1,2,3,4,5,6,7,8,9,10,11};
        ArrayImg<UnsignedByteType, ByteArray> img = ArrayImgs.unsignedBytes(bytes, bytes.length);
        ZstandardCompression compressor = new ZstandardCompression();
        compressor.setSetCloseFrameOnFlush(true);
        N5Utils.save(img, zarr, dset, new int[]{1024}, compressor);

        byte[] compressedBytes = Files.readAllBytes(Paths.get(root, dset, "0"));
        System.out.println(Zstd.getFrameContentSize(compressedBytes));
    }
}
mkitti commented 10 months ago

Basically the problem is that at the time the Zstandard frame header is written it does not seem to know the size of the input. Thus it marks it as unknown. numcodecs does not know what to do with an unknown size.

Rather than using the stream API we may need to a buffer API.

To address the issue rather specifically, we may need to use setPledgedSrcSize. https://www.javadoc.io/doc/com.github.luben/zstd-jni/latest/com/github/luben/zstd/ZstdCompressCtx.html

bogovicj commented 10 months ago

thanks for investigating @mkitti

mkitti commented 10 months ago

This PR to n5-zstandard fixes the issue for me.

https://github.com/JaneliaSciComp/n5-zstandard/pull/3

bogovicj commented 10 months ago

N5Factory().openWriter( root )

comes from n5-universe.. The tests I was running involved adding Zstandard compression to the list of options in the imagej export plugin in https://github.com/saalfeldlab/n5-ij

mkitti commented 10 months ago

@bogovicj I updated n5-zstandard to version 1.0.2