scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
563 stars 152 forks source link

Investigate lzf compression as default. #123

Open LuckyMD opened 5 years ago

LuckyMD commented 5 years ago

Hi,

I just updated Scanpy and AnnData from versions 1.3.2 and 0.6.11 to the latest github commits. When running the same notebook (specifically the command sc.read(filename, cache=True), I got an error about not having sufficient space to store the cache files. This seems to have been some jupyter limitation and rerunning the notebook made it work fine. However, I noticed that the size of the cache files has increased from ~ 6.5-8.5MB to 27-33MB. Is this intentional and is there a reason for this? These files were created by more or less exactly the same code, which is the Case study tutorial here. I haven't tested yet whether I can use the smaller cache files to load the same data in the new scanpy/anndata versions, but I could try this if you're not sure why this happened.

flying-sheep commented 5 years ago

Hi! can you give us an overview what’s in the old object vs the new one?

for each of X, layers, obs[*], obsm[*], var[*], varm[*], uns[*][*]... following, wherever there are differences:

That’ll help us narrowing it down. A notebook is not a good minimal reproducible example

falexwolf commented 5 years ago

It's in the release notes of Scanpy and Anndata: https://scanpy.readthedocs.io/en/latest/#version-1-4-february-5-2019 and https://anndata.readthedocs.io/en/latest/

By default, we don't compress anymore to increase read and write speed.

LuckyMD commented 5 years ago

I wasn't trying to give a minimal reproducible example, as that requires different scanpy and anndata versions... I am just talking about the cache files generated automatically when using sc.read(filename, cache=True). There is no data stored in there other than what is loaded in the first instance without any processing. The dataset loaded is exactly the same one with no .raw attributes, no .obs or .var data other than numerical indices. I just looked into the files again and they essentially contain the same information. For example:

data.X
<27998x2348 sparse matrix of type '<class 'numpy.float32'>'
    with 4109917 stored elements in Compressed Sparse Row format>

is exactly the same for both files. I find a difference in the output of h5dump -H. The differences are in how .obs and .var is stored. Here are the different parts for the smaller file:

   DATASET "obs" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLPAD;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         } "index";
      }
      DATASPACE  SIMPLE { ( 27998 ) / ( 27998 ) }
   }
   DATASET "var" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLPAD;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         } "index";
      }
      DATASPACE  SIMPLE { ( 2348 ) / ( 2348 ) }
   }

And the bigger file:

   DATASET "obs" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         } "index";
      }
      DATASPACE  SIMPLE { ( 27998 ) / ( 27998 ) }
   }
   DATASET "var" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         } "index";
      }
      DATASPACE  SIMPLE { ( 2348 ) / ( 2348 ) }
   }

Is this something you changed? I just though I would bring up something that looks like an unintended consequence of another change... or maybe has to do with hdf5.

falexwolf commented 5 years ago

It's also in the docs of .write: https://anndata.readthedocs.io/en/latest/anndata.AnnData.write.html

If you want a default switch to change the behavior, that's easy to set up. :)

LuckyMD commented 5 years ago

Ah, okay... just wanted to make sure this was intended! Thanks @falexwolf

falexwolf commented 5 years ago

Yes, it was intended. :) Made the change after starting to work with much bigger datasets which took ages to load when compressed...

flying-sheep commented 5 years ago

Yes, the compression… gzip should be very fast usually but the docs say it it’s “moderate speed”. Maybe their implementation isn’t good. However they also say:

LZF filter ("lzf")

Available with every installation of h5py (C source code also available). Low to moderate compression, very fast. No options.

I think we should switch to that one!

The STRSIZE change should actually reduce file size a little bit.

falexwolf commented 5 years ago

Interesting, I didn't know about lzf! Let's first make some public benchmarks both for read and write speed and file size. There are many fans of blosc compression, but that's tricky to install, hence not an option. gzip is definitely much, much slower on large data. I think, I once observed 15 min versus 20 sec or so in loading, the writing was even more horrible, where the file would take 30 min to write... It was absolutely prohibitive...

flying-sheep commented 5 years ago

gzip is definitely much, much slower on large data

Sure, I believe you. I just said that I assume it’s the implementation’s fault, not the algorithm’s. I think there’s much faster gzip implementations out there.

But for us, lzf is probably the only realistic chance apart from uncompressed. And uncompressed is not very practical either.

flying-sheep commented 5 years ago

https://community.centminmod.com/threads/round-3-compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.17259/

flying-sheep commented 5 years ago

Thanks, that option is very useful and works perfectly.

mini-benchmark on my data (compared to non-compressed):
gzip: 35% memory, 3× runtime
lzf: 66% memory, 2× runtime — @VolkerBergen, https://github.com/theislab/scanpy/pull/831#issuecomment-531750992

Great, thank you! So your’re talking about decompression right? I’m pretty sure the numbers differ based on dataset, the HDF5 people said:

In benchmark trials with floating-point data (below), a filter pipeline with LZF typically provides 3×-5× faster compression than DEFLATE, 2× faster decompression, and retains 50%-90% of the DEFLATE compression ratio.

The argument for turning off compressed caches was that it compresses super slow, which would be fixed if the 3×-5× figure holds up.