Open ilibarra opened 2 years ago
I'm all for this, but think it requires a new API for indexing.
How do we make sure the result of adata[[1,2,3]]
is unambiguous if the labels are integer valued? xarray
and pandas
support this with indexing attributes: df.iloc[...]
vs df.loc[...]
for pandas and da[...]
vs da.loc[...]
for xarray.
The hard part here is backwards compatibility.
We would also need this for single nucleotide resolution data. Tagging @mffrank here.
@ivirshup, maybe one way to do that would be to only allow a range index, this way e.g. adata[np.array(0, 2)]
will mean getting the first and the third obs
irrespective of whether you think those are names or indices.
As range indices were not possible before, that should be ok with backwards compatibility.
The idea behind this is not that we need to use integers for indexing but rather that we don't want to have string indices in some use cases.
Previous discussion: https://github.com/scverse/anndata/issues/311
Thanks, @ivirshup, I was just going to write that it seems like one can even use RangeIndex already, and we just need to figure out (a) places where the unnecessary RangeIndex -> str
conversion should be avoided and (b) (de)serialisation (should be easier than before since #555).
import numpy as np
import pandas as pd
import anndata
df = pd.DataFrame(np.random.normal(size=(100,10)))
adata = AnnData(x)
# => anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
# => warnings.warn("Transforming to str index.", ImplicitModificationWarning)
adata.obs_names = df.index
adata.obs_names
# => RangeIndex(start=0, stop=100, step=1)
adata[[0,9,99]]
# => View of AnnData object with n_obs × n_vars = 3 × 10
@LucaMarconato, this is relevant for speeding up reading of points for FISH-like data
My current thinking here is just to copy the API of xarray as much as possible.
The idea is to eventually move all label based indexing to a .loc
attribute, and keep []
for positional.
So, we deprecate indexing with []
when it's label based. However, this would not yet allow label based indexing by integers, as any current calls to []
that used positional indexing but was passed an AnnData with integer indexes would have incorrect behavior.
The most important goal is to do this in a way that causes the fewest bugs possible. However, we would also like to quickly get to allowing integer based indexes.
We could have a significant period of time where using labels when indexing with []
throws an error. This should give downstream packages enough time to update.
Then we start allowing integer indices, assuming all label based indexing code is using .loc
.
iloc
At the same time as loc
we could introduce iloc
for positional indexing. This would get us to integer indices much faster.
For any ambiguous cases (e.g. where axes are integer valued) we could throw errors for []
.
We then need to decide if we want to make this permanent. In this case we have effectively made []
indexing something to avoid, as whether it works is dependent on the contents of the AnnData.
After some period of time, we could then start allowing positional indexing via []
again, and deprecate .iloc
. While I don't love the idea of an intentionally temporary API, removing iloc
will be a very easy find and replace.
@ivirshup, my current thinking is to try to stay away from .loc
and explore other alternatives.
While .loc
could help to make it more pandas-like, it doesn't seem to me like that should be a goal here.
With the goal that you mentioned in mind, one alternative discussed recently might be to introduce integer index and a new index class that will allow to use it instead of string-based indices. I.e. adata[[0, 1, 2]]
would still be subsetting by row numbers while adata[Index(3, 4)]
would refer to the integer index.
Something like this will also require significantly less time to roll out and gather feedback. This will be a backward-compatible change.
A hybrid between the suggested approached might be an addition of a new method (.loc
-like) for the integer/range index while using integers in []
to refer to row numbers without any ambiguity. (For strings there is no ambiguity to start with.) That also makes sense in practice as integer index essentially means that we do not care about feature names per se (but we still need some IDs in order to keep track). This is also a backward-compatible change.
I'd be up to discuss more and maybe prototype some things!
Thanks for the suggestions!
As a minor point:
While
.loc
could help to make it more pandas-like, it doesn't seem to me like that should be a goal here.
I'm aiming much more at xarray
than pandas
here. A big upside of removing this ambiguity is that we could also start using the xarray
index dtypes.
I broadly like this idea. I think DimensionalData.jl takes a similar approach, where you just use different types for different kinds of indexing. I think there are a couple downsides though:
idx = adata.obs.query("celltype == 'a'").index
adata[idx] # or adata[idx.value]
If obs_names
were strings, this would be label based indexing, but if it were integer valued this would be positional. So any function that uses this kind of selection logic (I'm sure this exists in scanpy for example) would have incorrect behavior if passed an AnnData with integer labels.
I would like it more for python if there were popular libraries that behaved like this. I would also like to have indexing expressions that work as adata[idx]
to work on adata.obsm["a"][idx]
where possible.
A hybrid between the suggested approached might be an addition of a new method (
.loc
-like) for the integer/range index while using integers in[]
to refer to row numbers without any ambiguity.
I'm a little confused by this proposal. How is this different from my suggested .loc
option? And would statements that make the index type not backwards compatible be an issue here as well?
Not so much a solution, as just addressing the performance problem a different way: we could have a different form of cheap labels. Maybe just fixed length 64bit values (or 128 if we want them to be uuids).
While I don't believe pandas has a fixed length dtype, this could be done with the ArrowExtensionArray
dtype and arrays with pyarrow.binary(nbytes)
labels.
[…] I think I'm leaning towards non-numeric, but efficient, datatypes. Like a UUID as a fixed length byte-string. The previous big blocker to this seems to have been removed in pandas:
– @ivirshup in https://github.com/scverse/anndata/issues/199#issuecomment-1630937671
I fully agree. We already support integer slicing, with the semantic meaning of “index”. So I’d much prefer adding support for a newtype pattern using fixed size types and using that.
That way we could support genomic ranges, k-mers/cell hashes, or UUIDs.
I’m editing the title of this issue to match the original description which captured this. I’m also tentatively adding the “breaking change” label, but we might come up with an approach that isn’t breaking any assumptions.
The nice thing about allowing non-integer types for the index is that it isn't a breaking change (well, maybe if edge cases were relying on the string conversion for non-integer types).
Though apparently this would cause problems for R inter-op because they also only allow string rownames/ column names.
Here’s a proof of concept for an UUID array type:
https://gist.github.com/flying-sheep/99f2ceafdc494f97424222611b4f9474
We also have a use case in HuBMAP for storing annotated feature matrices for imaging data, with summary statistics and annotations for cells and nuclei identified in some image.
In this case, the identifiers for cells or nuclei fundamentally are integers, with object i composed of all pixels in the segmentation mask image that have value i. Additionally, the index for these not only has to start from 1 (due to the convention of pixel value 0 meaning "background", not part of any cell or nucleus or other type of object), but needs to be non-contiguous, so as I understand it we couldn't use a RangeIndex
.
(The non-contiguous case occurs whenever cells/nuclei touch the border of the image; it isn't very meaningful to compute total or mean protein expression or cell shape when half of the cell might be cut off.)
For the moment we'll have to work around this by storing the object IDs as strings, but this is wrong -- the type of that identifier is "integer, starting from 1, with arbitrary portions of the range missing".
Following issue #35, using integers in obs_names/var_names is allowed, yet slicing the objects is not possible after that. Could solutions for this particular exceptional case be discussed/added to the codebase?
Broadly speaking, if this is solved, it would help in integer- and bit-based representation of biological sequences as k-mers, and would play a role not only in sequence-based analyses of genomics data but additionally proteomics, RNA biology, etc.
Others who are interested in this and could maybe join in the discussion about memory and implementation considerations are @gtca @olgabot. Please tag others.
Thank you.