kiyo-masui / bitshuffle

Filter for improving compression of typed binary data.
Other
219 stars 76 forks source link

ZTSD support #95

Closed james-s-willis closed 2 years ago

james-s-willis commented 3 years ago

I've added support for ZSTD. I think I have included changes in the correct place. Could you take a look and let me know?

Also, for testing it I used bitshuffle/tests/test_h5filter.py and filter_opts=((0, 3),), where 3 should be the ZSTD compression. Is that the best way to test this kind of change?

kiyo-masui commented 3 years ago

This was remarkably quick.

Everything here looks correct to me.

I noticed that my continuous integration checks haven't popped up in the PR. I wonder why that is. In any case, it seemed to not build on travis (https://travis-ci.org/github/kiyo-masui/bitshuffle/jobs/769740650). I think CI was already broken, but it seems to be more broken now.

The other thing that needs to be added are tests. Should be straight forward to duplicate all the unit tests that are done for LZ4 to ZSTD. For regression tests we generate a small amount of test data that gets added directly to the repository. There is a script in the tests/ directory that should be easy to adapt.

jrs65 commented 3 years ago

Agreed. That was remarkably quick. Nice work!

One thing that seems like it is missing is the ability to tune the amount of compression zstd applies. That's a feature in zstd itself and should get exposed here.

I'll have a more thorough look through later on today.

james-s-willis commented 3 years ago

Thanks!

I have been struggling to expose the compression level for ZSTD due to the bshuf_blocked_wrap_fun function. I have had to add const int comp_lvl as an extra argument to quite a few functions, when it's not going to be used in most of them. I'm not sure if this is the best way to do this.

Also, to be able to pass it from kotekan I added it as:const int comp_lvl = cd_values[5]; to bshuf_h5_filter. Is this also right?

Regarding the unit tests how do I select ZSTD as the compression library? Can I just use: filter_opts=((0, h5.H5_COMPRESS_ZSTD),),?

kiyo-masui commented 3 years ago

Yeah, that’s a bummer about bshuf_blocked_wrap_fun... I think adding an extra argument is the cleanest way. It is okay to change the call signature since that isn’t exposed in the .h file. I would call it something more generic than comp_lvl, maybe just option in case there are other functions with integer options we want to wrap in the future.

Yes, I think the new option should go in the cd_values array.

Yes, those filter options should work.

On May 7, 2021, at 4:24 PM, james-s-willis @.***> wrote:



Thanks!

I have been struggling to expose the compression level for ZSTD due to the bshuf_blocked_wrap_fun function. I have had to add const int comp_lvl as an extra argument to quite a few functions, when it's not going to be used in most of them. I'm not sure if this is the best way to do this.

Also, to be able to pass it from kotekan I added it as:const int comp_lvl = cd_values[5]; to bshuf_h5_filter. Is this also right?

Regarding the unit tests how do I select ZSTD as the compression library? Can I just use: filter_opts=((0, h5.H5_COMPRESS_ZSTD),),?

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHubhttps://github.com/kiyo-masui/bitshuffle/pull/95#issuecomment-834751173, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADBPH2KEDU4225XVGUD4VDTMRD6ZANCNFSM44HWB2JA.

james-s-willis commented 3 years ago

@kiyo-masui @jrs65 I've been struggling to run the test_h5filter.py after I've installed bitshuffle with the new ZSTD extension:

[willis@magenta bitshuffle]$ python bitshuffle/tests/test_h5filter.py 
Traceback (most recent call last):
  File "bitshuffle/tests/test_h5filter.py", line 12, in <module>
    from bitshuffle import h5
  File "/usr/local/lib/python2.7/dist-packages/bitshuffle/__init__.py", line 22, in <module>
    from bitshuffle.ext import (__version__, bitshuffle, bitunshuffle, using_NEON, using_SSE2,
ImportError: /usr/local/lib/python2.7/dist-packages/bitshuffle/ext.so: undefined symbol: ZSTD_compressBlock_fast

from this it looks like the ZSTD_compressBlock_fast hasn't been compiled into the source. So I added the source file to setup.py and reinstalled bitshuffle:

+zstd_sources = ["zstd/compress/zstd_compress.c",
+                "zstd/compress/zstd_fast.c",
+                "zstd/compress/zstd_double_fast.c",
+                "zstd/decompress/zstd_decompress.c"
+                ] 
+
+zstd_headers = ["zstd/zstd.h",
+                "zstd/compress/zstd_fast.h"
+                ]

 ext_bshuf = Extension(
     "bitshuffle.ext",
     sources=["bitshuffle/ext.pyx", "src/bitshuffle.c",
              "src/bitshuffle_core.c", "src/iochain.c",
-             "lz4/lz4.c", "zstd/compress/zstd_compress.c",
-             "zstd/decompress/zstd_decompress.c"],
+             "lz4/lz4.c"] + zstd_sources,
     include_dirs=["src/", "lz4/", "zstd/"],
     depends=["src/bitshuffle.h", "src/bitshuffle_core.h",
-             "src/iochain.h", "lz4/lz4.h", "zstd/zstd.h"],
+             "src/iochain.h", "lz4/lz4.h"] + zstd_headers,
     libraries=[],
     define_macros=MACROS,
 )
@@ -124,12 +134,10 @@ h5filter = Extension(
     "bitshuffle.h5",
     sources=["bitshuffle/h5.pyx", "src/bshuf_h5filter.c",
              "src/bitshuffle.c", "src/bitshuffle_core.c",
-             "src/iochain.c", "lz4/lz4.c", 
-             "zstd/compress/zstd_compress.c",
-             "zstd/decompress/zstd_decompress.c"],
+             "src/iochain.c", "lz4/lz4.c"] + zstd_sources,
     depends=["src/bitshuffle.h", "src/bitshuffle_core.h",
              "src/iochain.h", "src/bshuf_h5filter.h",
-             "lz4/lz4.h", "zstd/zstd.h"],
+             "lz4/lz4.h"] + zstd_headers,
     define_macros=MACROS,
     **pkgconfig("hdf5", config=dict(
         include_dirs=["src/", "lz4/", "zstd/"]))
@@ -139,12 +147,10 @@ filter_plugin = Extension(
     "bitshuffle.plugin.libh5bshuf",
     sources=["src/bshuf_h5plugin.c", "src/bshuf_h5filter.c",
              "src/bitshuffle.c", "src/bitshuffle_core.c",
-             "src/iochain.c", "lz4/lz4.c",
-             "zstd/compress/zstd_compress.c",
-             "zstd/decompress/zstd_decompress.c"],
+             "src/iochain.c", "lz4/lz4.c"] + zstd_sources,
     depends=["src/bitshuffle.h", "src/bitshuffle_core.h",
              "src/iochain.h", 'src/bshuf_h5filter.h',
-             "lz4/lz4.h", "zstd/zstd.h"],
+             "lz4/lz4.h"] + zstd_headers,

but I run into the same problem again. Could it actually be that bitshuffle is not installed if there are no changes to source files? I use this command: sudo python setup.py --verbose install --h5plugin --h5plugin-dir=/usr/local/hdf5/lib/plugin

james-s-willis commented 3 years ago

Oh I pushed those changes to the setup.py including the extra source files, now travis gets further and fails on:

nosetests -v bitshuffle

Failure: ImportError (/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/bitshuffle/ext.cpython-36m-x86_64-linux-gnu.so: undefined symbol: ZSTD_compressBlock_greedy) ... ERROR

so it must be a problem with my local python setup.

james-s-willis commented 3 years ago

@jrs65 @kiyo-masui, I think this is ready to be looked at now. I have added ZSTD to the unit tests and fixed an issue with the __version__ string. I apologise for the 80 new files... but they are mostly from the ZSTD library. I have trimmed that down to common/, compress/ and decompress/ folders which is all we need from it.

The tests pass, but I will need to work with Arash to make sure everything is okay. I will look into adding ZSTD into kotekan also.

kiyo-masui commented 3 years ago

Please undelete the old regression test data and make sure the current tests still test against that data. The whole idea of the regression tests is to make sure newer versions of bit shuffle can still decompress data created with previous versions.

Also, instead of having separate files for regression testing lz4 and zst, you should just add another dataset to the 0.3.6 regression test data that is 1st compressed.

kiyo-masui commented 3 years ago

Could you also augment the README to include the new functionality?

jrs65 commented 3 years ago

@james-s-willis would it be worth just putting the zstd stuff into a git submodule? You could also consider making it optional if it really kills the build times.

kiyo-masui commented 3 years ago

does it kill the build times?

james-s-willis commented 3 years ago

I've updated the tests from your comments, @kiyo-masui and updated the README.

The build now takes 1m41.844s with ZSTD.

Also, how do I update the CI config? It fails with a missing python package (packaging), which I added for the regression test.

jrs65 commented 3 years ago

@james-s-willis as for the CI build I think the best thing would be to try and get #93 merged first.

james-s-willis commented 3 years ago

@jrs65 I've merged the github actions changes into this branch now and got it to pass the CI. Is this good to merge now?

jrs65 commented 3 years ago

@james-s-willis can you investigate whether using a git submodule for zstd makes sense? I think when this started I didn't appreciate that zstd was going to add 70 files. Otherwise, I'll have a look through the changes now.

james-s-willis commented 3 years ago

I've created another branch: https://github.com/kiyo-masui/bitshuffle/tree/jsw/zstd-sub that adds ZSTD as a submodule. It was simple enough the only extra step you need is to run: git submodule update --init if you're installing with ZSTD support. It reduces the total no. of extra files, but you get the whole of the ZSTD repo when you initialise the submodule. In the other branch I had only picked ZSTD files that we were using.

jrs65 commented 3 years ago

@james-s-willis which do you think is the right thing to do? Also @kiyo-masui maybe you care here.

kiyo-masui commented 3 years ago

I'll defer to the judgement of you two on this one.

I guess when uploading to Pypi, you would include the ZSTD files?

james-s-willis commented 3 years ago

@jrs65 I think we should just stick with this branch. As we've trimmed zstd to what we need and it is fixed at v1.4.9, so we know it works. Unless we want the latest version of zstd?

jrs65 commented 3 years ago

@james-s-willis what's the difference in file numbers? If it's not that many, on reflection I think the submodule is the better option. It makes it far easier to update to revised versions of zstd in the future.