nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
144 stars 28 forks source link

Decompression #44

Closed jamez-eh closed 3 years ago

jamez-eh commented 4 years ago

Hi, I want to compress a large number of files so am looking to validate each compassion. I would like to be able to compress and decompress each file so I can check the fidelity and catch any errors. Is this possible?

I have tried using compress_fast5 going from raw -> VBZ -> gzip and then from raw -> gzip, and comparing the two gzips (this would be good enough), but they produce different sized files.

Is there any way to reverse the VBZ compression and get the original file back?

Thanks, James

fbrennen commented 4 years ago

Hi @jamez-eh -- there isn't a direct way to de-compress files at the moment, but it's something we've put a bit of work into recently. We'll finish this off and make that available as an option.

Have you tried hdf5 tools like h5diff to compare files to each other? All of our compression algorithms are lossless so you should have literally the same data regardless of how you compress them.

jamez-eh commented 3 years ago

I am concerned about I/O errors when potential compressing petabytes of data. Normally, I would just reverse the compression and check the Hash values, but that is obviously not possible here. h5diff checks the values held within objects, and when comparing a VBZ compressed object to a GZIP we get differences. At first I thought I could just change it back to GZIP and run h5diff, but that has turned out to be problematic across differing levels of GZIP compression.

fbrennen commented 3 years ago

Hi @jamez-eh -- I've run a few tests and I'm able to use h5diff to verify that gzip-compressed and vbz-compressed files are identical. What are you doing when you see differences between them?

This is what I did:

$ compress_fast5 -i multi_read/ -s gzip_compressed/ -c gzip
$ compress_fast5 -i multi_read/ -s vbz_compressed/ -c vbz
$ h5diff multi_read/batch_0.fast5 gzip_compressed/batch_0.fast5
$ h5diff multi_read/batch_0.fast5 vbz_compressed/batch_0.fast5
$ h5diff gzip_compressed/batch_0.fast5 vbz_compressed/batch_0.fast5
$ compress_fast5 -i vbz_compressed/ -s vbz_then_gzip_compressed/ -c gzip
$ h5diff gzip_compressed/batch_0.fast5 vbz_then_gzip_compressed/batch_0.fast5
jamez-eh commented 3 years ago

Thank you for getting back to me.

That is actually similar to my experience. The problem though is that for those h5diff commands I get a non zero exit code. If If I instead try and run the compression twice to see if comparing what should be two files that are identical with verbosity turned on:

compress_fast5 -i multi_read/ -s vbz_compressed/_1 -c vbz
compress_fast5 -i multi_read/ -s vbz_compressed_2/ -c vbz
h5diff -c -v0 vbz_compressed_1/batch_0.fast5 vbz_compressed_2/batch_0.fast5

I will get a long list of warning messages with an exit code 2:

  Warning: dataset </read_ffc74b6c-4994-46da-a010-8cf89b9c971c/Raw/Signal> cannot be read, user defined filter is not available
  Warning: dataset </read_ffe6e742-d3f8-4f50-bc00-98a3d7185f1b/Raw/Signal> cannot be read, user defined filter is not available
  Warning: dataset </read_ffebdc67-ad36-4f3d-9b24-e084b00c2df6/Raw/Signal> cannot be read, user defined filter is not available

It is clear that the signals are not actually being compared here because I don't have the correct user defined filters setup. Did you set those up?

Investigating the fast5 files further with h5dump I can see that these datasets have vbz filter that I don't have configured:

                  FILTER_ID 32020
                  COMMENT vbz
                  PARAMS { 1 2 1 1 }
               }

Is this something that you configured or could you point me in the direction of how to do this?

fbrennen commented 3 years ago

Ah, I see! vbz is a custom compression filter that we built to encode the nanopore signal more efficiently. You can get the plugin here: https://github.com/nanoporetech/vbz_compression/

jamez-eh commented 3 years ago

I have attempted to follow these instructions and install the compression filter, but I am having the same problems that appeared here: https://github.com/nanoporetech/vbz_compression/issues/1

I think that my my hd5tools does not identify the plugin despite it. I have set $HDF5_PLUGIN_PATH when I run ls $HDF5_PLUGIN_PATH I see libvbz_hdf_plugin.so. It seems there is very little documentation out there about how to configure hdf5 plugins in general let alone this specific one. Any advice would be greatly appreciated, I am getting nowhere with this.

Do I need to have specific version of HDF5tools? The compress_fast5 works as indicated, but if I want to use h5repack I'm out of luck?

Thanks again

fbrennen commented 3 years ago

I would suggest not bothering with h5repack -- in normal reads the space savings you'll get using vbz come from compressing the Signal dataset (the raw nanopore squiggle), and compress_fast5 will handle that for you.

I completely agree with you about hdf documentation -- we had a fair bit of trouble ourselves getting everything to work properly. I do think what you have done should work, so it's possible the version of hdf5 you have is either buggy or just too old -- could you try upgrading it? For comparison, I'm using HDF5 1.8.16 (a fairly old version) on Ubuntu 16 and setting HDF5_PLUGIN_PATH allows me to view vbz-compressed datasets just fine.

mbhall88 commented 2 years ago

I found that the only way I could get a decompressed fast5 file was converting the fast5 file to slow5 and then converting back to fast5.

It seems odd to provide compression of fast5 files without also providing decompression...

Maybe this issue should remain open?

fbrennen commented 2 years ago

Hi @mbhall88 -- out of curiosity, why do you want to decompress your files?

mbhall88 commented 2 years ago

Hey @fbrennen. I was having some problems with a tombo command (resquiggle) not liking my vbz-compressed fast5s. In the end, I actually switched the compression to gzip and it worked. But I used a small batch of uncompressed files to test that it was indeed a compression problem.