HDF5 tuning - Githubissues

So I'm reading up on my HDF5 and h5py, since I want to make sure I strip out the image pyramid for the heatmap in the proper way. I'm also investigating why the loom file keeps increasing in size when we generate heatmap tiles.

Along the way I came across a few interesting things:

Chunk Caching

(emphasis mine)

Every chunked dataset has a chunk cache associated with it that has a default size of 1 MB. The purpose of the chunk cache is to improve performance by keeping chunks that are accessed frequently in memory so that they do not have to be accessed from disk. If a chunk is too large to fit in the chunk cache, it can significantly degrade performance. However, the size of the chunk cache can be increased by calling H5Pset_chunk_cache.

https://support.hdfgroup.org/HDF5/Tutor/layout.html

Ok, so let's see how this works out for us. Let's do a column fetch. That's 25000 rows (because that's the amount of genes). The default chunk size is 10x10. We use 32bit floats, so that is 400 bytes per chunk. We fetch ten rows at a time per chunk, so that results in: 25000 * 400 bytes / 10 = 1 MB.

Yeeeaahh... we might want to increase that cache value... I mean we're planning to have the datasets grow a lot bigger than 25000 cells, right? Would a few hundred megabytes or so work?

Now, h5py does not actually let us set the chunk cache. However, there is a complementary h5py_cache package, which is also availble on Anaconda:

https://github.com/moble/h5py_cache

https://anaconda.org/moble/h5py_cache

SHUFFLE filter

(The SHUFFLE filter) exploits the fact that, for many datasets, most of the entropy occurs in only a few bytes of the type. For example, if you have a 4-byte unsigned integer dataset but your values are generally zero to a few thousand, most of the data resides in the (third and fourth byte). The SHUFFLE filter repacks the data in the chunk so that all the first bytes of the integers are together, then all the second bytes, and so on. (...) For dictionary-based compressors like GZIP and LZF, it’s much more efficient to compress long runs of identical values, like all the zero values collected from the first two bytes of the dataset’s integers. (There would also be savings) from repeated elements at the third byte position.

The SHUFFLE filter is very, very fast (negligible compared to the compression time). Here’s how to enable it in conjunction with GZIP:

>>> dset = myfile.create_dataset("Data", (1000,), compression="gzip", shuffle=True)

https://www.safaribooksonline.com/library/view/python-and-hdf5/9781491944981/ch04.html

The example mentions integers, and the float32 format is a bit more complex than integer numbers, using an exponent and a fraction:

... I do happen to know that the following things hold:

for any positive whole number from 0 up to and including 63, the last two bytes of a float32 are zero.
for any positive whole number up to about 65k, the last byte of a float32 is zero

With that in mind I expect that this thing about entropy not being distributed equally across bytes is also true for our float32 values. At least, if we assume that they are always positive whole numbers, and strongly centred around zero, which seems to be true for the default loom files that come out of the pipeline.

So I would expect that the shuffle filter increases compression ratios, at (supposedly) negligible performance costs - it might even improve performance since reading from disk is likely a greater bottleneck.

FLETCH32 checksum

Accidents happen. When you’re storing or transmitting a multiterabyte dataset, you’d like to be sure that the bytes that come out of the file are the same ones you put in. HDF5 includes a checksum filter for just this purpose. It uses a 32-bit implementation of Fletcher’s checksum, hence the name FLETCHER32.

A checksum is computed when each chunk is written to the file, and recorded in the chunk’s metadata. When the chunk is read, the checksum is computed again and compared to the old one. If they don’t match, an error is raised and the read fails.

Fletcher's checksum is a very fast checksum algorithm, to the point where I expect that fetching bytes from the disk takes longer than it takes to compute this checksum alongside it (even on SSDs). I don't expect this to significantly slow down performance if we enable it. We might want to try it out just to be sure all the data is correct?

https://www.safaribooksonline.com/library/view/python-and-hdf5/9781491944981/ch04.html

Fancy Indexing is slow when used with `h5py`

a big part of this is that the fancy-indexing code in h5py uses a naive algorithm based on repeated hyperslab selection, which is quadratic in the number of indices. It was designed/tested for small numbers of indices.

The particular example you have here (0 to 10000 in steps of 10) can be mapped to slices (although of course this is not generally true). In this case the results are:
%%timeit f = h5py.File('test.h5','r')
f['/test'][0:10000:10]

100 loops, best of 3: 12 ms per loop
This is a great argument to improve the implementation of fancy indexing in h5py, but I would hesitate to conclude "HDF5 is slow".

https://gist.github.com/rossant/7b4704e8caeb8f173084#gistcomment-1665072

I’ve implemented a batch-wise fancy indexing in loompy.py, in the method batch_scan(). It loads rectangles one at a time and yields the relevant columns (rows) for each batch. This makes it much faster to iterate over a large file.

/Sten

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 8 Feb 2017, at 12:44, Job van der Zwan notifications@github.com wrote:

Fancy Indexing is slow when used with h5py

a big part of this is that the fancy-indexing code in h5py uses a naive algorithm based on repeated hyperslab selection, which is quadratic in the number of indices. It was designed/tested for small numbers of indices.

The particular example you have here (0 to 10000 in steps of 10) can be mapped to slices (although of course this is not generally true). In this case the results are:

%%timeit f = h5py.File('test.h5','r') f['/test'][0:10000:10]

100 loops, best of 3: 12 ms per loop

This is a great argument to improve the implementation of fancy indexing in h5py, but I would hesitate to conclude "HDF5 is slow".

https://gist.github.com/rossant/7b4704e8caeb8f173084#gistcomment-1665072

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Ok, so that takes care of the fancy indexing issue then.

I'd like to try if changing caching settings will make it even faster. We basically have two opposing methods of access to think of: sequential reading (like batch_scan) and random access:

For random access, we can assume all chunks have to be loaded from disk every time, so we want smaller chunks to reduce disk reads.
For sequential reads, the more data we can fit in the chunk cache, the less disk reads we need, so we want our chunks to be as large as they can possibly be under this constraint.

To expand on that second point:

For row access only the chunk height matters, since we load the entire width of the data matrix anyway. Similarly, for column access only chunk width matters.
When scanning over rows/columns, as long as our chunks fit in the cache, disk reads are roughly divided by the chunk height/width.
We always want to be able to fit all chunks of a row/column in cache: if we bump out the first elements of a row/column by the time we reach the end of it, we'll never actually use the cache when moving to the next row/column¹, resulting in no better performance than random access

I'm going to assume a million cells is our upper bound for now. Using float32 values that means each row is about 4MB of data. Given 25k genes, each column is roughly 100kB. Say we bump the cache size to a conservative 200MB. In that case:

maximum chunk height equals 200MB / 4MB = 50
maximum chunk width equals 200MB / 100kB = 2000

Currently, our chunk size is 10x10, so for random row access that's a slowdown of 5x, and random column access a slowdown of 200x.

So as a quick, safe trade-off I'll try a 64x64 chunk size with 512MB chunk cache for now.

¹ Actually, in theory we could do "snakewise" accessing, where we access even rows/columns incrementally, and odd rows/columns decrementally. Then we "only" have to re-read the bumped-out chunks at the edges every time. However, that's a ridiculous code complication at probably very little gain.

Aside: is the desired default behaviour not compressing the loom files? Because if not, I think this is a bug, since most of the time we do not pass a compression_opts argument when calling loompy.create:

if compression_opts:
    f.create_dataset('matrix', data=matrix.astype(matrix_dtype), maxshape=(matrix.shape[0], None), chunks=(min(chunks[0], matrix.shape[0]), min(chunks[1], matrix.shape[1])))
else:
    f.create_dataset('matrix', data=matrix.astype(matrix_dtype), maxshape=(matrix.shape[0], None), chunks=(min(chunks[0], matrix.shape[0]), min(chunks[1], matrix.shape[1])), compression="gzip",compression_opts=compression_opts)

from lines 78-81 of loompy.py

EDIT: Wait, that's a full-on bug anyway, since passing compression_opts makes us throw it away.

The maintainers of h5py recommend "an absolute minimum of 10k", or 100x100, for chunk size. Of course that's a general rule of thumb, so we should check how that applies to our data.

Some numbers using the h5repack tool to test various settings, all with a "fresh" loom file off the server:

 218276 10X13_2.loom
 217700 10X13_2-repack-oldchunk.loom
  60880 10X13_2-repack-32.loom
  44564 10X13_2-repack-64.loom
  40064 10X13_2-repack-128.loom
  38772 10X13_2-repack-256.loom

1569980 Forebrain_E9-E18.5.loom
1174480 Forebrain_E9-E18.5-repack-oldchunk.loom
 464172 Forebrain_E9-E18.5-repack-32.loom
 382532 Forebrain_E9-E18.5-repack-64.loom
 357568 Forebrain_E9-E18.5-repack-128.loom
 348700 Forebrain_E9-E18.5-repack-256.loom

In terms of compression I think 64x64 is a nice sweet spot for chunk size, before the diminishing returns kicks in. The question is whether it is actually the fastest read option; I will try making some random-access and sequential access scripts to see.

One of the purposes of h5repack is to reclaim empty space when data is deleted in an HDF5 file, so oldchunk is just a test to see if we accidentally pad our HDF5 files in the pipeline. With that in mind I also tested the "bloated" Forebrain file:

3097976 Forebrain_E9-E18.5.loom
2608288 Forebrain_E9-E18.5-repack-oldchunk.loom

So we can see that whatever caused the bloat of the loom files during tile generation is permanent. I tried using h5dump to see what's going on. For the fresh Forebrain file, the output was a 27 MB text file... for the bloated one it was a 223 MB (!) text file

The vast majority of this dump is the tiles, which is structured like:

   GROUP "tiles" {
      GROUP "10z" {
         DATASET "0x_0y" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 256, 256 ) / ( 256, 256 ) }
            STORAGE_LAYOUT {
               CHUNKED ( 32, 64 )
               SIZE 6004 (43.662:1 COMPRESSION)
            }
            FILTERS {
               COMPRESSION DEFLATE { LEVEL 4 }
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_ALLOC
               VALUE  H5D_FILL_VALUE_DEFAULT
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_INCR
            }
         }
         DATASET "0x_1y" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 256, 256 ) / ( 256, 256 ) }
            STORAGE_LAYOUT {
               CHUNKED ( 32, 64 )
               SIZE 6019 (43.553:1 COMPRESSION)
            }
            FILTERS {
               COMPRESSION DEFLATE { LEVEL 4 }
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_ALLOC
               VALUE  H5D_FILL_VALUE_DEFAULT
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_INCR
            }
         }
         DATASET "0x_2y" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SIMPLE { ( 256, 256 ) / ( 256, 256 ) }
            STORAGE_LAYOUT {
               CHUNKED ( 32, 64 )
               SIZE 6598 (39.731:1 COMPRESSION)
            }
            FILTERS {
               COMPRESSION DEFLATE { LEVEL 4 }
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_ALLOC
               VALUE  H5D_FILL_VALUE_DEFAULT
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_INCR
            }
         }
... etc

Again, the separate folder with all the PNGs is about 170MB, so this is more evidence that storing the image pyramid in the Loom is not the best approach.

Also, just as a safeguard in the future, I've added an optional mode parameter to LoomConnection that lets us open loom files in read-only mode. I recommend changing the code where we know there should not be any modifications to the loom file.

    def __init__(self, filename, mode='r+'):
        """
        Establish a connection to a .loom file.

        Args:
            filename (str):      Name of the .loom file to open
            mode (str):          read/write mode, accepts 'r+' (read/write) or
                                 'r' (read-only), defaults to 'r+'

        Returns:
            Nothing.

        Row and column attributes are loaded into memory for fast access.
        """

(It might actually make things faster too since it means it is safe to do multiple reads from the same file)

Ok, this is getting positively weird: naive sequential access is slightly slower than random access, regardless of chunk size/cache setting

Not by much, mind you, but still.

Here's the fairly dumb benchmark code I used:

def benchmark_command(infile):
    logging.info("Benchmarking %s random row access, 10 x 100" % infile)
    setup = 'gc.enable(); import loompy; import random; ds = loompy.connect("%s", "r"); rmax = ds.file["matrix"].shape[0]-1' % infile
    testfunc = 'for i in range(0,100): ds[random.randint(0, rmax),:]'
    t = timeit.Timer(testfunc, setup)
    logging.info(t.timeit(10))

    logging.info("Benchmarking %s loading 100 rows at once, 10 x" % infile)
    setup = 'gc.enable(); import loompy; import random; ds = loompy.connect("%s", "r")' % infile
    testfunc = 'ds[0:100,:]'
    t = timeit.Timer(testfunc, setup)
    logging.info(t.timeit(10))

    logging.info("Benchmarking %s sequential row access, 10 x 100" % infile)
    setup = 'gc.enable(); import loompy; import random; ds = loompy.connect("%s", "r")' % infile
    testfunc = 'for i in range(0,100): ds[i,:]'
    t = timeit.Timer(testfunc, setup)
    logging.info(t.timeit(10))

To make comparison's easier I've sorted and formatted the output:

random row access, 10 x 100
 3.1995459190002293   cortex.loom
 2.176201806003519    cortex_032.loom
 2.764606158998504    cortex_064.loom
 4.528644990998146    cortex_128.loom
 8.311958325997693    cortex_256.loom
40.718590956996195    cortex_1024_cache.loom

loading 100 rows at once, 10 x
0.036810575002164114  cortex.loom
0.0848499380008434    cortex_032.loom
0.05705405500339111   cortex_064.loom
0.05021580500033451   cortex_128.loom
0.08792913499928545   cortex_256.loom
0.4272061660012696    cortex_1024_cache.loom

sequential row access, 10 x 100
 3.419669454000541    cortex.loom
 2.300491485002567    cortex_032.loom
 2.924495070001285    cortex_064.loom
 4.85549013200216     cortex_128.loom
 8.670183384005213    cortex_256.loom
42.35230635100015     cortex_1024_cache.loom

Recall that the original cortex.loom is 10x10 chunks. I tried giving all of these royally huge cache sizes, but I'm not seeing any differences. I think somewhere along the way h5py overwrites these manual cache settings again, or something, so we're note seeing any effect. So I'm reverting back to regular h5py instead of h5py_cache

Anyway, looks like 64x64 chunk size does not reduce performance significantly, so I'm going to go ahead and push this.

Some final example numbers:


218272 10X13_2.loom
217704 10X13_2-repack-oldchunk.loom
 60880 10X13_2-repack-032.loom
 44564 10X13_2-repack-064.loom
 40064 10X13_2-repack-128.loom
 38772 10X13_2-repack-256.loom
 18788 10X13_2-64.loom
 25036 10X13_2-64.loom.tiles/

Combined with #88, loom files are now more than ten times smaller than before (which is nice for downloading), and with separate tiles included we save about five times the space. Performance is about the same, or perhaps even slightly better. These numbers may vary based on how many zeros our datasets contain, of course.

If we ever figure out how to get that stupid chunk caching to work properly, we might even get a bigger boost, but I'm kinda done with this for now and would like to get back to the front-end.

One last thing: watch out with using loom from-loom, as it loads the entire data matrix in memory before copying it. Again, might be optimised, but since it only was added as a testing command to compare different chunking schemes I don't really care at the moment.

So I did one last attempt at figuring out why the random access isn't significantly faster than the sequential access, using pygraph (like I did when optimising BackSPIN). I learned... nothing.

pycallgraph_fancy pycallgraph_random pycallgraph_seq

Basically, the profiler stops once we hit the C code, so whatever is causing the slowdown, it's somewhere in the guts of HDF5, and I can't reach it. Sigh, I'll just have to accept we don't really know why HDF5 performs the way it does.

linnarsson-lab / loom-viewer

HDF5 tuning #90

Chunk Caching

SHUFFLE filter

FLETCH32 checksum

Fancy Indexing is slow when used with `h5py`