Open LuckyMD opened 5 years ago
Hi! can you give us an overview what’s in the old object vs the new one?
for each of X
, layers
, obs[*]
, obsm[*]
, var[*]
, varm[*]
, uns[*][*]...
following, wherever there are differences:
spmatrix.getnnz()
)That’ll help us narrowing it down. A notebook is not a good minimal reproducible example
It's in the release notes of Scanpy and Anndata: https://scanpy.readthedocs.io/en/latest/#version-1-4-february-5-2019 and https://anndata.readthedocs.io/en/latest/
By default, we don't compress anymore to increase read and write speed.
I wasn't trying to give a minimal reproducible example, as that requires different scanpy and anndata versions... I am just talking about the cache files generated automatically when using sc.read(filename, cache=True)
. There is no data stored in there other than what is loaded in the first instance without any processing. The dataset loaded is exactly the same one with no .raw
attributes, no .obs
or .var
data other than numerical indices. I just looked into the files again and they essentially contain the same information. For example:
data.X
<27998x2348 sparse matrix of type '<class 'numpy.float32'>'
with 4109917 stored elements in Compressed Sparse Row format>
is exactly the same for both files. I find a difference in the output of h5dump -H
. The differences are in how .obs
and .var
is stored. Here are the different parts for the smaller file:
DATASET "obs" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "index";
}
DATASPACE SIMPLE { ( 27998 ) / ( 27998 ) }
}
DATASET "var" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 4;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "index";
}
DATASPACE SIMPLE { ( 2348 ) / ( 2348 ) }
}
And the bigger file:
DATASET "obs" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
} "index";
}
DATASPACE SIMPLE { ( 27998 ) / ( 27998 ) }
}
DATASET "var" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
} "index";
}
DATASPACE SIMPLE { ( 2348 ) / ( 2348 ) }
}
Is this something you changed? I just though I would bring up something that looks like an unintended consequence of another change... or maybe has to do with hdf5.
It's also in the docs of .write
: https://anndata.readthedocs.io/en/latest/anndata.AnnData.write.html
If you want a default switch to change the behavior, that's easy to set up. :)
Ah, okay... just wanted to make sure this was intended! Thanks @falexwolf
Yes, it was intended. :) Made the change after starting to work with much bigger datasets which took ages to load when compressed...
Yes, the compression… gzip should be very fast usually but the docs say it it’s “moderate speed”. Maybe their implementation isn’t good. However they also say:
LZF filter (
"lzf"
)Available with every installation of h5py (C source code also available). Low to moderate compression, very fast. No options.
I think we should switch to that one!
The STRSIZE change should actually reduce file size a little bit.
Interesting, I didn't know about lzf
! Let's first make some public benchmarks both for read and write speed and file size. There are many fans of blosc compression, but that's tricky to install, hence not an option. gzip
is definitely much, much slower on large data. I think, I once observed 15 min versus 20 sec or so in loading, the writing was even more horrible, where the file would take 30 min to write... It was absolutely prohibitive...
gzip is definitely much, much slower on large data
Sure, I believe you. I just said that I assume it’s the implementation’s fault, not the algorithm’s. I think there’s much faster gzip implementations out there.
But for us, lzf
is probably the only realistic chance apart from uncompressed. And uncompressed is not very practical either.
Thanks, that option is very useful and works perfectly.
mini-benchmark on my data (compared to non-compressed):
gzip: 35% memory, 3× runtime
lzf: 66% memory, 2× runtime — @VolkerBergen, https://github.com/theislab/scanpy/pull/831#issuecomment-531750992
Great, thank you! So your’re talking about decompression right? I’m pretty sure the numbers differ based on dataset, the HDF5 people said:
In benchmark trials with floating-point data (below), a filter pipeline with LZF typically provides 3×-5× faster compression than DEFLATE, 2× faster decompression, and retains 50%-90% of the DEFLATE compression ratio.
The argument for turning off compressed caches was that it compresses super slow, which would be fixed if the 3×-5× figure holds up.
Hi,
I just updated Scanpy and AnnData from versions 1.3.2 and 0.6.11 to the latest github commits. When running the same notebook (specifically the command
sc.read(filename, cache=True)
, I got an error about not having sufficient space to store the cache files. This seems to have been some jupyter limitation and rerunning the notebook made it work fine. However, I noticed that the size of the cache files has increased from ~ 6.5-8.5MB to 27-33MB. Is this intentional and is there a reason for this? These files were created by more or less exactly the same code, which is the Case study tutorial here. I haven't tested yet whether I can use the smaller cache files to load the same data in the new scanpy/anndata versions, but I could try this if you're not sure why this happened.