Indexing too slow - Githubissues

linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org

BSD 2-Clause "Simplified" License

140 stars 37 forks source link

Hi - yes that's a weakness of the HDF5 format. Under the hood, that data is chunked (I think we use 64x64) so loading 100 columns really means loading 6400 columns.

One suggestion would be to first permute the columns (using the permute() method and a random permutation vector) and then to sample sets of adjacent columns. The permutation will take a long time, but you do it only once.

For really high performance, the best is likely to create a dense raw matrix on disk and use numpy memory mapped arrays (https://numpy.org/doc/stable/reference/generated/numpy.memmap.html). That should let you read near the speed of the disk.

linnarsson-lab / loompy

Indexing too slow #144