scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 152 forks source link

copy of anndata view -> "F" ordered layer but "C" ordered X? #433

Open adamgayoso opened 4 years ago

adamgayoso commented 4 years ago

Is this desired behavior?

In [50]: adata = anndata.AnnData(np.arange(16).reshape(4, 4))
In [52]: adata.layers["test"] = adata.X.copy()
In [54]: bdata = adata[:, [1, 3]].copy()
In [55]: bdata.X.flags
Out[55]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
In [56]: bdata.layers["test"].flags
Out[56]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Is the layer being properly copied?

Also, should order be doing something here? https://github.com/theislab/anndata/blob/86311e36e499b615a30709b3f0f85940d4f3a629/anndata/_core/views.py#L82-L84

ivirshup commented 4 years ago

Huh, that's interesting. Was this causing any issues for you?

I think it's okay that the layer is column major, since that's default numpy behaviour.

>>> np.ones((5, 5))[:, [1, 3]].flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

It's weird that .X is treated differently.

What's happening is that .copy is being called on a numpy array, which changes the order to row major. It would be good to avoid that extra copy, but the easy change here breaks when X is a dask array.

https://github.com/theislab/anndata/blob/77a633cc825b428d02dabdf75491841c3a175a8d/anndata/_core/anndata.py#L1423-L1430


As for the ArrayView class, I didn't actually write that so I'm not sure what the intent was. @falexwolf, do you have some insight here?

adamgayoso commented 4 years ago

Thanks for clarifying, basically we wrote a pytorch data loader on top of anndata for the new scvi-tools (which allows to use X, layers, or raw) and "C" order is a lot faster for this purpose. Haven't quite figured out if the speed change is post data loading or not (it could be matrix multiplication once in the model), but I think it's the latter.

It sounds like we should check the order and run something like adata.X = np.asarray(adata.X, order="C") (resp. adata.layers[..] = np.asarray(....)) in our package?

ivirshup commented 4 years ago

It sounds like we should check the order and run something like adata.X = np.asarray(adata.X, order="C")

Short answer, yes. I'm not really happy with our/ numpy's behaviour here (I'd prefer order was maintained), but don't see an obvious way to do that.

and "C" order is a lot faster for this purpose

That's a good point. I think most access to data in an AnnData object is likely to be for reads, and it would make sense to keep the ordering consistent. The more I think about it, the stranger it is that numpy returns column major arrays when you index a row major array. It also looks like we could actually get better performance from maintaining row order (some examples here https://github.com/numpy/numpy/issues/9450#issuecomment-468478381).

I see three approaches we can take here:

  1. The current approach, we try to defer to what the underlying array library would do.
  2. We try to maintain layout. That is, if it's row(/column) major, we try to make the result of an indexing operation row(/column) major.
  3. We try to optimize. This is a bit difficult to define, since optimizing individual operations may slow downstream tasks.

I think 1 and 2 are both reasonable options. 1 lets us not make any decisions, but has the potential for poor/ unintuitive performance. Option 2 puts the onus of performance on the user, but is more predictable.

Unfortunately, I'm not sure there's a good way to implement option 2 without writing a lot of handling code. I spent more time on this than I should have, and found that performance isn't really that predictable from various indexing methods. For what is supposed to be the fastest method (a[:, idx], not always fastest in practice), numpy does not promise the order of the output. For the method which has better performance in the example case above (np.take), numpy does not value performance/ probably won't improve it.

Here are some basic benchmark results for indexing 2000 columns from a 10,000 x 10,000 array.

Benchmark results | | func | input_order | output_order | time | |---:|:-----------------------|:--------------|:---------------|:-----------------------------------------------------------------------| | 0 | index_notation | C | F | 626 ms ± 92.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) | | 1 | index_notation | F | F | 36.2 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) | | 2 | index_take_wrap | C | C | 191 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) | | 3 | index_take_wrap | F | C | 1.54 s ± 74.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) | | 4 | index_take_wrap_colmjr | C | F | 491 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) | | 5 | index_take_wrap_colmjr | F | F | 1.82 s ± 79.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Benchmark code ```python import numpy as np import pandas as pd from itertools import product def index_notation(a, idx): return a[:, idx] def index_take(a, idx, **kwargs): out = np.empty(shape=(len(a), len(idx)), dtype=a.dtype) np.take(a, idx, axis=1, out=out, **kwargs) return out def index_take_colmjr(a, idx): out = np.empty(shape=(len(a), len(idx)), dtype=a.dtype, order="F") np.take(a, idx, axis=1, out=out) return out def index_take_wrap(a, idx): out = np.empty(shape=(len(a), len(idx)), dtype=a.dtype) np.take(a, idx, axis=1, out=out, mode="wrap") return out def index_take_wrap_colmjr(a, idx): out = np.empty(shape=(len(a), len(idx)), dtype=a.dtype, order="F") np.take(a, idx, axis=1, out=out, mode="wrap") return out funcs = [ index_notation, index_take, index_take_colmjr, index_take_wrap, index_take_wrap_colmjr, ] c_array = np.random.randint(0, 100, (10000, 10000)) f_array = c_array.copy(order="F") idx = np.random.choice(np.arange(c_array.shape[1], dtype=int), 2000, replace=False) results = [] for func, array in product(funcs, [c_array, f_array]): time_rec = %timeit -o func(array, idx) record = { "func": func.__name__, "input_order": ["F", "C"][array.flags.c_contiguous], "output_order": ["F", "C"][func(array, idx).flags.c_contiguous], "time": time_rec } results.append(record) print(pd.DataFrame(results).to_markdown()) ```
adamgayoso commented 4 years ago

Is

The current approach, we try to defer to what the underlying array library would do.

the current approach? It seems like since .copy of X as a numpy array is being called and the order argument is being improperly passed in the ArrayView copy method, that the intention was to always do "C" order?

Like, since the default behavior of copying a numpy array is for "C" order output, why should the ArrayView object be any different?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!