Open adamgayoso opened 4 years ago
Huh, that's interesting. Was this causing any issues for you?
I think it's okay that the layer is column major, since that's default numpy behaviour.
>>> np.ones((5, 5))[:, [1, 3]].flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
It's weird that .X
is treated differently.
What's happening is that .copy
is being called on a numpy array, which changes the order to row major. It would be good to avoid that extra copy, but the easy change here breaks when X
is a dask array.
As for the ArrayView
class, I didn't actually write that so I'm not sure what the intent was. @falexwolf, do you have some insight here?
Thanks for clarifying, basically we wrote a pytorch data loader on top of anndata for the new scvi-tools (which allows to use X, layers, or raw) and "C" order is a lot faster for this purpose. Haven't quite figured out if the speed change is post data loading or not (it could be matrix multiplication once in the model), but I think it's the latter.
It sounds like we should check the order and run something like adata.X = np.asarray(adata.X, order="C")
(resp. adata.layers[..] = np.asarray(....)
) in our package?
It sounds like we should check the order and run something like adata.X = np.asarray(adata.X, order="C")
Short answer, yes. I'm not really happy with our/ numpy's behaviour here (I'd prefer order was maintained), but don't see an obvious way to do that.
and "C" order is a lot faster for this purpose
That's a good point. I think most access to data in an AnnData object is likely to be for reads, and it would make sense to keep the ordering consistent. The more I think about it, the stranger it is that numpy returns column major arrays when you index a row major array. It also looks like we could actually get better performance from maintaining row order (some examples here https://github.com/numpy/numpy/issues/9450#issuecomment-468478381).
I see three approaches we can take here:
I think 1 and 2 are both reasonable options. 1 lets us not make any decisions, but has the potential for poor/ unintuitive performance. Option 2 puts the onus of performance on the user, but is more predictable.
Unfortunately, I'm not sure there's a good way to implement option 2 without writing a lot of handling code. I spent more time on this than I should have, and found that performance isn't really that predictable from various indexing methods. For what is supposed to be the fastest method (a[:, idx]
, not always fastest in practice), numpy does not promise the order of the output. For the method which has better performance in the example case above (np.take
), numpy does not value performance/ probably won't improve it.
Here are some basic benchmark results for indexing 2000 columns from a 10,000 x 10,000 array.
Is
The current approach, we try to defer to what the underlying array library would do.
the current approach? It seems like since .copy
of X as a numpy array is being called and the order
argument is being improperly passed in the ArrayView
copy method, that the intention was to always do "C" order?
Like, since the default behavior of copying a numpy array is for "C" order output, why should the ArrayView
object be any different?
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Is this desired behavior?
Is the layer being properly copied?
Also, should order be doing something here? https://github.com/theislab/anndata/blob/86311e36e499b615a30709b3f0f85940d4f3a629/anndata/_core/views.py#L82-L84