Open ivirshup opened 11 months ago
Some benchmarking results using more fixed inputs:
fmt | compression | chunksize | index_type | time |
---|---|---|---|---|
h5 | None | 100000 | slices | 1.35 |
zarr | None | 100000 | slices | 1.42 |
h5 | None | 10000 | slices | 1.57 |
zarr | lz4 | 100000 | slices | 1.6 |
zarr | None | 10000 | slices | 2.53 |
h5 | lz4 | 100000 | slices | 3.15 |
zarr | lz4 | 10000 | slices | 3.33 |
h5 | lz4 | 10000 | slices | 4.54 |
h5 | None | 100000 | bool | 5.32 |
h5 | None | 10000 | bool | 5.47 |
h5 | lz4 | 100000 | bool | 6.93 |
h5 | lz4 | 10000 | bool | 8.4 |
zarr | None | 10000 | bool | 15.8 |
zarr | None | 100000 | bool | 21.3 |
zarr | lz4 | 10000 | bool | 21.4 |
zarr | lz4 | 100000 | bool | 41.5 |
I just want to give a reason as to why this is slowing down in general when using mask vs. slice vs. integer (i.e., for both zarr
and h5
):
Internally, scipy
intercepts our mask here and converts it to an integer index (or at least that is what seems to happen) before actually indexing. So this explains why the performance of those two (boolean and integer index) are the same. Something else I noticed is that we even do our own internal conversion here but this actually is not called in this demo due to scipy
doing the same thing as noted.
Now I'll try to speak as to why the performance of integer indices is so bad since this is actually what is going on under the hood. The problem appears to be on our end from calling get_compressed_vectors
where the following occurs, operating on integer indices (as noted):
slices = [slice(*(x.indptr[i : i + 2])) for i in row_idxs]
So of course this is going to be very slow. Creating very small slices for every entry in row_idxs
and then accessing the data
/indices
every time is not efficient.
I'm looking into a solution and a heuristic now (for the slices). I think that we should either
mask_to_slices
, or_get_arrayXslice
which will work for both boolean and integer indices.I will also need to benchmark to find the heuristic so that we don't tank performance as you mentioned.
Something I forgot to mention/do is to see how this performs without our backing class. This could reveal a potential upstreamed-fix but I doubt it, given the problematic methods we use are ones that we overrode.
I'm looking into a solution and a heuristic now
I did have an idea about the heuristic, but also a bigger brain moment.
Is there any difference between the proposed solution and the current behaviour in the worst case scenario? Either way a long list of slices is generated then accessed.
So unless we also develop something more optimal for our worst case scenario, maybe performance won't be tanked by using this approach everywhere?
I believe a lot of the performance is lost based on us needing to decompress the whole chunk every time we do any indexing. So if we could have an LRU cache for the decompressed chunks that could be quite useful. But maybe separate PR since it's a bit orthogonal though would also help.
I think you showed there was a difference so I believe that something is going on. I tried to find out what but only have some suspicions.
I think the difference is the the number of calls to csr_matrix
- there's a non-trivial time-overhead to creating these and the functions get_compressed_vectors
and get_compressed_vector
seem to have different speed profiles (i.e, get_compressed_vectors != n * get_compressed_vector
for n
accesses via get_compressed_vectors
).
So when we do the the proposed solution, we are making many small csr_matrix
calls. For example, when I do one X_h5[mask_to_slices(mask_alternating)[0]]
call, I timed the underlying get_compressed_vector
and subsequent ss.csr_matrix
call in the code - the ss.csr_matrix
takes on ~1.5x as long. Whereas in the current case, there is only one ss.csr_matrix
call (which can take longer, but not on the scale of the number of calls we make).
There also seems to be some advantage to doing the list comprehension over the arrays themselves as opposed to on the top-level i.e., the proposed solution. But I can't reproduce this. I suspect there is some sort of hot-start/cold-start thing going on since some random access where much worse than others (perhaps what you were saying about caching?). What I can say for sure is that csr_matrix
in the best case scenario doubles/triples the needed time, and that the list comprehension within get_compressed_vectors
is also likely giving performance speedups (but I don't know why).
TL;DR There is definitely a difference.
Just to pull back a bit (for myself):
I have a little prototype demo locally that does not completely tank performance in the worst case (but is still worse). Maybe it's a starting point.
In terms of the heuristic for when to stop using our improvement, performance stops completely tanking pretty quickly:
for max_cont in range(2, 100):
mask_alternating = np.zeros(X_zarr.shape[0], dtype=bool)
for i in range(max_cont - 1):
mask_alternating[i::max_cont] = True
print('solution with slices of size ', max_cont - 1, ' contiguous chunks')
%time l = [X_zarr[s] for s in mask_to_slices(mask_alternating)]
print('current with slices of size ', max_cont - 1, ' contiguous chunks')
%time l = X_zarr[mask_alternating]
So perhaps an average size of like 5 or 10 for contiguous chunks? Or maybe some other summary statistic? In any case this should give a decent guide. I'll push my local branch soon
Is there any difference between the proposed solution and the current behaviour in the worst case scenario? Either way a long list of slices is generated then accessed.
So when we do the the proposed solution, we are making many small csr_matrix calls.
We could probably write the proposed solution to put together a matrix from the underlying arrays and not doing all the intermediate construction of sparse matrices. Then these cases should be equivalent?
Please describe your wishes and possible alternatives to achieve the desired result.
The problem
Indexing statements for backed anndata objects often look like:
Notably, data is often ordered such that categories of interest are actually colocalized, e.g. data is sorted by patient or condition.
The indexing statement above generates a boolean mask or integer array indexer which can unfortunately be slow, especially with zarr backed matrices.
Zarr benchmarks
For example:
Potential solution
However, if we simply translate the mask into contiguous slices:
We see significant speed ups, even accounting for the overhead (and an admittedly inefficient quick implementation, we should be able to avoid constructing so many intermediate matrices). Of course, the results are all the same:
HDF5 benchmarks
HDF5 has similar behaviour, but is considerably faster with no optimization:
Demo using h5py + same matrices
```python import h5py h5_group = h5py.File("demo.h5", mode="w") # using compression for a fair-ish comparison write_elem(h5_group, "X", X, dataset_kwargs={"compression": "lzf"}) X_h5 = sparse_dataset(h5_group["X"]) X_h5 ``` ```python %time as_mask = X_h5[mask] # CPU times: user 1.77 s, sys: 11.8 ms, total: 1.78 s # Wall time: 1.79 s ``` ```python # converting to integer indices %time as_indices = X_h5[np.where(mask)] # CPU times: user 1.92 s, sys: 31 ms, total: 1.96 s # Wall time: 1.96 s ``` ```python # Converting to slices %time as_slices = sparse.vstack([X_h5[s] for s in mask_to_slices(mask)]) # CPU times: user 59 ms, sys: 15.7 ms, total: 74.7 ms # Wall time: 74.7 ms ```Proposal
We should make some effort to optimize our indexing into stores. The solution above works well for boolean masks. The overhead of calling the function is also very low (~0.1% of the optimized operation).
Complications
Heuristics are needed for choosing when to bail out of operations:
The above optimization can actually tank performance in worst case scenarios:
But the slices are very cheap to compute. So I would suggest we look at the slices generated and choose a path based on some threshold.
Integer based indexing is more complicated
For integer based indexing we will have multiple additional sorting steps. These are significantly more expensive than finding runs of
bool
s.Alternatives?
Upstreaming
Some performance enhancements could be upstreamed. Ideally enough to get zarr on par with hdf5. However, indexing into the sparse data structures is very much something we should be handling directly.
Alternative optimization strategies
There are likely even more effective optimization strategies we could pursue. These are not strictly alternatives, however. These include:
Demo using real data (TODO, make available)
Not as much of a performance boost, but does put zarr inline with h5py. Not totally sure why
cc: @ilan-gold