Open rvinas opened 3 years ago
Hi - yes that's a weakness of the HDF5 format. Under the hood, that data is chunked (I think we use 64x64) so loading 100 columns really means loading 6400 columns.
One suggestion would be to first permute the columns (using the permute() method and a random permutation vector) and then to sample sets of adjacent columns. The permutation will take a long time, but you do it only once.
For really high performance, the best is likely to create a dense raw matrix on disk and use numpy memory mapped arrays (https://numpy.org/doc/stable/reference/generated/numpy.memmap.html). That should let you read near the speed of the disk.
Hello,
I have a dataset
ds
with ~60k rows and ~3 million columns. I'd like to retrieve certain columns (e.g. at most 100 at once), but the indexing is way too slow (e.g.ds[:, list_with_100_random_indices]
). What is the recommended way to sample data efficiently from the dataset? Otherwise, is there a workaround (perhaps not using loompy?). This would be really useful to train machine learning models.Thank you, Ramon