Closed JobLeonard closed 7 years ago
h5py
a big part of this is that the fancy-indexing code in h5py uses a naive algorithm based on repeated hyperslab selection, which is quadratic in the number of indices. It was designed/tested for small numbers of indices.
The particular example you have here (0 to 10000 in steps of 10) can be mapped to slices (although of course this is not generally true). In this case the results are:
%%timeit f = h5py.File('test.h5','r') f['/test'][0:10000:10] 100 loops, best of 3: 12 ms per loop
This is a great argument to improve the implementation of fancy indexing in h5py, but I would hesitate to conclude "HDF5 is slow".
https://gist.github.com/rossant/7b4704e8caeb8f173084#gistcomment-1665072
I’ve implemented a batch-wise fancy indexing in loompy.py, in the method batch_scan(). It loads rectangles one at a time and yields the relevant columns (rows) for each batch. This makes it much faster to iterate over a large file.
/Sten
-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)
On 8 Feb 2017, at 12:44, Job van der Zwan notifications@github.com wrote:
Fancy Indexing is slow when used with h5py
a big part of this is that the fancy-indexing code in h5py uses a naive algorithm based on repeated hyperslab selection, which is quadratic in the number of indices. It was designed/tested for small numbers of indices.
The particular example you have here (0 to 10000 in steps of 10) can be mapped to slices (although of course this is not generally true). In this case the results are:
%%timeit f = h5py.File('test.h5','r') f['/test'][0:10000:10]
100 loops, best of 3: 12 ms per loop
This is a great argument to improve the implementation of fancy indexing in h5py, but I would hesitate to conclude "HDF5 is slow".
https://gist.github.com/rossant/7b4704e8caeb8f173084#gistcomment-1665072
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Ok, so that takes care of the fancy indexing issue then.
I'd like to try if changing caching settings will make it even faster. We basically have two opposing methods of access to think of: sequential reading (like batch_scan
) and random access:
To expand on that second point:
I'm going to assume a million cells is our upper bound for now. Using float32 values that means each row is about 4MB of data. Given 25k genes, each column is roughly 100kB. Say we bump the cache size to a conservative 200MB. In that case:
Currently, our chunk size is 10x10, so for random row access that's a slowdown of 5x, and random column access a slowdown of 200x.
So as a quick, safe trade-off I'll try a 64x64 chunk size with 512MB chunk cache for now.
¹ Actually, in theory we could do "snakewise" accessing, where we access even rows/columns incrementally, and odd rows/columns decrementally. Then we "only" have to re-read the bumped-out chunks at the edges every time. However, that's a ridiculous code complication at probably very little gain.
Aside: is the desired default behaviour not compressing the loom files? Because if not, I think this is a bug, since most of the time we do not pass a compression_opts
argument when calling loompy.create
:
if compression_opts:
f.create_dataset('matrix', data=matrix.astype(matrix_dtype), maxshape=(matrix.shape[0], None), chunks=(min(chunks[0], matrix.shape[0]), min(chunks[1], matrix.shape[1])))
else:
f.create_dataset('matrix', data=matrix.astype(matrix_dtype), maxshape=(matrix.shape[0], None), chunks=(min(chunks[0], matrix.shape[0]), min(chunks[1], matrix.shape[1])), compression="gzip",compression_opts=compression_opts)
EDIT: Wait, that's a full-on bug anyway, since passing compression_opts makes us throw it away.
The maintainers of h5py
recommend "an absolute minimum of 10k", or 100x100, for chunk size. Of course that's a general rule of thumb, so we should check how that applies to our data.
Some numbers using the h5repack
tool to test various settings, all with a "fresh" loom file off the server:
218276 10X13_2.loom
217700 10X13_2-repack-oldchunk.loom
60880 10X13_2-repack-32.loom
44564 10X13_2-repack-64.loom
40064 10X13_2-repack-128.loom
38772 10X13_2-repack-256.loom
1569980 Forebrain_E9-E18.5.loom
1174480 Forebrain_E9-E18.5-repack-oldchunk.loom
464172 Forebrain_E9-E18.5-repack-32.loom
382532 Forebrain_E9-E18.5-repack-64.loom
357568 Forebrain_E9-E18.5-repack-128.loom
348700 Forebrain_E9-E18.5-repack-256.loom
In terms of compression I think 64x64 is a nice sweet spot for chunk size, before the diminishing returns kicks in. The question is whether it is actually the fastest read option; I will try making some random-access and sequential access scripts to see.
One of the purposes of h5repack
is to reclaim empty space when data is deleted in an HDF5 file, so oldchunk
is just a test to see if we accidentally pad our HDF5 files in the pipeline. With that in mind I also tested the "bloated" Forebrain file:
3097976 Forebrain_E9-E18.5.loom
2608288 Forebrain_E9-E18.5-repack-oldchunk.loom
So we can see that whatever caused the bloat of the loom files during tile generation is permanent. I tried using h5dump
to see what's going on. For the fresh
Forebrain file, the output was a 27 MB text file... for the bloated one it was a 223 MB (!) text file
The vast majority of this dump is the tiles, which is structured like:
GROUP "tiles" {
GROUP "10z" {
DATASET "0x_0y" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 256, 256 ) / ( 256, 256 ) }
STORAGE_LAYOUT {
CHUNKED ( 32, 64 )
SIZE 6004 (43.662:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 4 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
DATASET "0x_1y" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 256, 256 ) / ( 256, 256 ) }
STORAGE_LAYOUT {
CHUNKED ( 32, 64 )
SIZE 6019 (43.553:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 4 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
DATASET "0x_2y" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 256, 256 ) / ( 256, 256 ) }
STORAGE_LAYOUT {
CHUNKED ( 32, 64 )
SIZE 6598 (39.731:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 4 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
... etc
Again, the separate folder with all the PNGs is about 170MB, so this is more evidence that storing the image pyramid in the Loom is not the best approach.
Also, just as a safeguard in the future, I've added an optional mode
parameter to LoomConnection
that lets us open loom
files in read-only mode. I recommend changing the code where we know there should not be any modifications to the loom file.
def __init__(self, filename, mode='r+'):
"""
Establish a connection to a .loom file.
Args:
filename (str): Name of the .loom file to open
mode (str): read/write mode, accepts 'r+' (read/write) or
'r' (read-only), defaults to 'r+'
Returns:
Nothing.
Row and column attributes are loaded into memory for fast access.
"""
(It might actually make things faster too since it means it is safe to do multiple reads from the same file)
Ok, this is getting positively weird: naive sequential access is slightly slower than random access, regardless of chunk size/cache setting
Not by much, mind you, but still.
Here's the fairly dumb benchmark code I used:
def benchmark_command(infile):
logging.info("Benchmarking %s random row access, 10 x 100" % infile)
setup = 'gc.enable(); import loompy; import random; ds = loompy.connect("%s", "r"); rmax = ds.file["matrix"].shape[0]-1' % infile
testfunc = 'for i in range(0,100): ds[random.randint(0, rmax),:]'
t = timeit.Timer(testfunc, setup)
logging.info(t.timeit(10))
logging.info("Benchmarking %s loading 100 rows at once, 10 x" % infile)
setup = 'gc.enable(); import loompy; import random; ds = loompy.connect("%s", "r")' % infile
testfunc = 'ds[0:100,:]'
t = timeit.Timer(testfunc, setup)
logging.info(t.timeit(10))
logging.info("Benchmarking %s sequential row access, 10 x 100" % infile)
setup = 'gc.enable(); import loompy; import random; ds = loompy.connect("%s", "r")' % infile
testfunc = 'for i in range(0,100): ds[i,:]'
t = timeit.Timer(testfunc, setup)
logging.info(t.timeit(10))
To make comparison's easier I've sorted and formatted the output:
random row access, 10 x 100
3.1995459190002293 cortex.loom
2.176201806003519 cortex_032.loom
2.764606158998504 cortex_064.loom
4.528644990998146 cortex_128.loom
8.311958325997693 cortex_256.loom
40.718590956996195 cortex_1024_cache.loom
loading 100 rows at once, 10 x
0.036810575002164114 cortex.loom
0.0848499380008434 cortex_032.loom
0.05705405500339111 cortex_064.loom
0.05021580500033451 cortex_128.loom
0.08792913499928545 cortex_256.loom
0.4272061660012696 cortex_1024_cache.loom
sequential row access, 10 x 100
3.419669454000541 cortex.loom
2.300491485002567 cortex_032.loom
2.924495070001285 cortex_064.loom
4.85549013200216 cortex_128.loom
8.670183384005213 cortex_256.loom
42.35230635100015 cortex_1024_cache.loom
Recall that the original cortex.loom
is 10x10 chunks. I tried giving all of these royally huge cache sizes, but I'm not seeing any differences. I think somewhere along the way h5py
overwrites these manual cache settings again, or something, so we're note seeing any effect. So I'm reverting back to regular h5py
instead of h5py_cache
Anyway, looks like 64x64 chunk size does not reduce performance significantly, so I'm going to go ahead and push this.
Some final example numbers:
218272 10X13_2.loom
217704 10X13_2-repack-oldchunk.loom
60880 10X13_2-repack-032.loom
44564 10X13_2-repack-064.loom
40064 10X13_2-repack-128.loom
38772 10X13_2-repack-256.loom
18788 10X13_2-64.loom
25036 10X13_2-64.loom.tiles/
Combined with #88, loom files are now more than ten times smaller than before (which is nice for downloading), and with separate tiles included we save about five times the space. Performance is about the same, or perhaps even slightly better. These numbers may vary based on how many zeros our datasets contain, of course.
If we ever figure out how to get that stupid chunk caching to work properly, we might even get a bigger boost, but I'm kinda done with this for now and would like to get back to the front-end.
One last thing: watch out with using loom from-loom
, as it loads the entire data matrix in memory before copying it. Again, might be optimised, but since it only was added as a testing command to compare different chunking schemes I don't really care at the moment.
So I did one last attempt at figuring out why the random access isn't significantly faster than the sequential access, using pygraph (like I did when optimising BackSPIN). I learned... nothing.
Basically, the profiler stops once we hit the C code, so whatever is causing the slowdown, it's somewhere in the guts of HDF5, and I can't reach it. Sigh, I'll just have to accept we don't really know why HDF5 performs the way it does.
So I'm reading up on my HDF5 and
h5py
, since I want to make sure I strip out the image pyramid for the heatmap in the proper way. I'm also investigating why the loom file keeps increasing in size when we generate heatmap tiles.Along the way I came across a few interesting things:
Chunk Caching
(emphasis mine)
https://support.hdfgroup.org/HDF5/Tutor/layout.html
Ok, so let's see how this works out for us. Let's do a column fetch. That's 25000 rows (because that's the amount of genes). The default chunk size is 10x10. We use 32bit floats, so that is 400 bytes per chunk. We fetch ten rows at a time per chunk, so that results in: 25000 * 400 bytes / 10 = 1 MB.
Yeeeaahh... we might want to increase that cache value... I mean we're planning to have the datasets grow a lot bigger than 25000 cells, right? Would a few hundred megabytes or so work?
Now,
h5py
does not actually let us set the chunk cache. However, there is a complementaryh5py_cache
package, which is also availble on Anaconda:https://github.com/moble/h5py_cache
https://anaconda.org/moble/h5py_cache
SHUFFLE filter
https://www.safaribooksonline.com/library/view/python-and-hdf5/9781491944981/ch04.html
The example mentions integers, and the float32 format is a bit more complex than integer numbers, using an exponent and a fraction:
... I do happen to know that the following things hold:
With that in mind I expect that this thing about entropy not being distributed equally across bytes is also true for our float32 values. At least, if we assume that they are always positive whole numbers, and strongly centred around zero, which seems to be true for the default loom files that come out of the pipeline.
So I would expect that the shuffle filter increases compression ratios, at (supposedly) negligible performance costs - it might even improve performance since reading from disk is likely a greater bottleneck.
FLETCH32 checksum
Fletcher's checksum is a very fast checksum algorithm, to the point where I expect that fetching bytes from the disk takes longer than it takes to compute this checksum alongside it (even on SSDs). I don't expect this to significantly slow down performance if we enable it. We might want to try it out just to be sure all the data is correct?
https://www.safaribooksonline.com/library/view/python-and-hdf5/9781491944981/ch04.html