linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
137 stars 36 forks source link

Indexing too slow #144

Open rvinas opened 3 years ago

rvinas commented 3 years ago

Hello,

I have a dataset ds with ~60k rows and ~3 million columns. I'd like to retrieve certain columns (e.g. at most 100 at once), but the indexing is way too slow (e.g. ds[:, list_with_100_random_indices]). What is the recommended way to sample data efficiently from the dataset? Otherwise, is there a workaround (perhaps not using loompy?). This would be really useful to train machine learning models.

Thank you, Ramon

slinnarsson commented 3 years ago

Hi - yes that's a weakness of the HDF5 format. Under the hood, that data is chunked (I think we use 64x64) so loading 100 columns really means loading 6400 columns.

One suggestion would be to first permute the columns (using the permute() method and a random permutation vector) and then to sample sets of adjacent columns. The permutation will take a long time, but you do it only once.

For really high performance, the best is likely to create a dense raw matrix on disk and use numpy memory mapped arrays (https://numpy.org/doc/stable/reference/generated/numpy.memmap.html). That should let you read near the speed of the disk.