Feature request - var_names/obs_names as fixed-sized types (integer or bytes)

ilibarra commented 2 years ago

Following issue #35, using integers in obs_names/var_names is allowed, yet slicing the objects is not possible after that. Could solutions for this particular exceptional case be discussed/added to the codebase?

Broadly speaking, if this is solved, it would help in integer- and bit-based representation of biological sequences as k-mers, and would play a role not only in sequence-based analyses of genomics data but additionally proteomics, RNA biology, etc.

Others who are interested in this and could maybe join in the discussion about memory and implementation considerations are @gtca @olgabot. Please tag others.

Thank you.

ivirshup commented 2 years ago

I'm all for this, but think it requires a new API for indexing.

How do we make sure the result of adata[[1,2,3]] is unambiguous if the labels are integer valued? xarray and pandas support this with indexing attributes: df.iloc[...] vs df.loc[...] for pandas and da[...] vs da.loc[...] for xarray.

The hard part here is backwards compatibility.

gtca commented 2 years ago

We would also need this for single nucleotide resolution data. Tagging @mffrank here.

@ivirshup, maybe one way to do that would be to only allow a range index, this way e.g. adata[np.array(0, 2)] will mean getting the first and the third obs irrespective of whether you think those are names or indices. As range indices were not possible before, that should be ok with backwards compatibility.

The idea behind this is not that we need to use integers for indexing but rather that we don't want to have string indices in some use cases.

ivirshup commented 2 years ago

Previous discussion: https://github.com/scverse/anndata/issues/311

gtca commented 2 years ago

Thanks, @ivirshup, I was just going to write that it seems like one can even use RangeIndex already, and we just need to figure out (a) places where the unnecessary RangeIndex -> str conversion should be avoided and (b) (de)serialisation (should be easier than before since #555).

import numpy as np
import pandas as pd
import anndata

df = pd.DataFrame(np.random.normal(size=(100,10)))
adata = AnnData(x)
# => anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
# =>   warnings.warn("Transforming to str index.", ImplicitModificationWarning)

adata.obs_names = df.index
adata.obs_names
# => RangeIndex(start=0, stop=100, step=1)

adata[[0,9,99]]
# => View of AnnData object with n_obs × n_vars = 3 × 10

ivirshup commented 1 year ago

@LucaMarconato, this is relevant for speeding up reading of points for FISH-like data

ivirshup commented 1 year ago

My current thinking here is just to copy the API of xarray as much as possible.

The idea is to eventually move all label based indexing to a .loc attribute, and keep [] for positional.

So, we deprecate indexing with [] when it's label based. However, this would not yet allow label based indexing by integers, as any current calls to [] that used positional indexing but was passed an AnnData with integer indexes would have incorrect behavior.

The most important goal is to do this in a way that causes the fewest bugs possible. However, we would also like to quickly get to allowing integer based indexes.

Solution 1: time

We could have a significant period of time where using labels when indexing with [] throws an error. This should give downstream packages enough time to update.

Then we start allowing integer indices, assuming all label based indexing code is using .loc.

Solution 2: `iloc`

At the same time as loc we could introduce iloc for positional indexing. This would get us to integer indices much faster.

For any ambiguous cases (e.g. where axes are integer valued) we could throw errors for [].

Permanently

We then need to decide if we want to make this permanent. In this case we have effectively made [] indexing something to avoid, as whether it works is dependent on the contents of the AnnData.

Temporarily

After some period of time, we could then start allowing positional indexing via [] again, and deprecate .iloc. While I don't love the idea of an intentionally temporary API, removing iloc will be a very easy find and replace.

Questions

How much time is needed here? I think it has to be signifigant. Possibly a year between first deprecation and last removal.
Are there less disruptive ways to do this?

gtca commented 1 year ago

@ivirshup, my current thinking is to try to stay away from .loc and explore other alternatives. While .loc could help to make it more pandas-like, it doesn't seem to me like that should be a goal here.

With the goal that you mentioned in mind, one alternative discussed recently might be to introduce integer index and a new index class that will allow to use it instead of string-based indices. I.e. adata[[0, 1, 2]] would still be subsetting by row numbers while adata[Index(3, 4)] would refer to the integer index. Something like this will also require significantly less time to roll out and gather feedback. This will be a backward-compatible change.

A hybrid between the suggested approached might be an addition of a new method (.loc-like) for the integer/range index while using integers in [] to refer to row numbers without any ambiguity. (For strings there is no ambiguity to start with.) That also makes sense in practice as integer index essentially means that we do not care about feature names per se (but we still need some IDs in order to keep track). This is also a backward-compatible change.

I'd be up to discuss more and maybe prototype some things!

ivirshup commented 1 year ago

Thanks for the suggestions!

As a minor point:

While .loc could help to make it more pandas-like, it doesn't seem to me like that should be a goal here.

I'm aiming much more at xarray than pandas here. A big upside of removing this ambiguity is that we could also start using the xarray index dtypes.

Index type

I broadly like this idea. I think DimensionalData.jl takes a similar approach, where you just use different types for different kinds of indexing. I think there are a couple downsides though:

Not fully backwards compatible

idx = adata.obs.query("celltype == 'a'").index
adata[idx] # or adata[idx.value]

If obs_names were strings, this would be label based indexing, but if it were integer valued this would be positional. So any function that uses this kind of selection logic (I'm sure this exists in scanpy for example) would have incorrect behavior if passed an AnnData with integer labels.

Familiar/ compatible with element types

I would like it more for python if there were popular libraries that behaved like this. I would also like to have indexing expressions that work as adata[idx] to work on adata.obsm["a"][idx] where possible.

I'm not sure I get the "hybrid" option

A hybrid between the suggested approached might be an addition of a new method (.loc-like) for the integer/range index while using integers in [] to refer to row numbers without any ambiguity.

I'm a little confused by this proposal. How is this different from my suggested .loc option? And would statements that make the index type not backwards compatible be an issue here as well?

Cheap labels (another alternative)

Not so much a solution, as just addressing the performance problem a different way: we could have a different form of cheap labels. Maybe just fixed length 64bit values (or 128 if we want them to be uuids).

While I don't believe pandas has a fixed length dtype, this could be done with the ArrowExtensionArray dtype and arrays with pyarrow.binary(nbytes) labels.

Idea for obs_names as 128 bit values

```python import pandas as pd, numpy as np, pyarrow as pa N = 40_000_000 np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0] pa_bytes = pa.array(np_bytes, type=pa.binary(16)) obs_names = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes)) ```

flying-sheep commented 1 year ago

[…] I think I'm leaning towards non-numeric, but efficient, datatypes. Like a UUID as a fixed length byte-string. The previous big blocker to this seems to have been removed in pandas:

https://github.com/pandas-dev/pandas/issues/50121#issuecomment-1630560449

– @ivirshup in https://github.com/scverse/anndata/issues/199#issuecomment-1630937671

I fully agree. We already support integer slicing, with the semantic meaning of “index”. So I’d much prefer adding support for a newtype pattern using fixed size types and using that.

That way we could support genomic ranges, k-mers/cell hashes, or UUIDs.

I’m editing the title of this issue to match the original description which captured this. I’m also tentatively adding the “breaking change” label, but we might come up with an approach that isn’t breaking any assumptions.

ivirshup commented 1 year ago

The nice thing about allowing non-integer types for the index is that it isn't a breaking change (well, maybe if edge cases were relying on the string conversion for non-integer types).

Though apparently this would cause problems for R inter-op because they also only allow string rownames/ column names.

flying-sheep commented 11 months ago

Here’s a proof of concept for an UUID array type:

https://gist.github.com/flying-sheep/99f2ceafdc494f97424222611b4f9474

mruffalo commented 2 months ago

We also have a use case in HuBMAP for storing annotated feature matrices for imaging data, with summary statistics and annotations for cells and nuclei identified in some image.

In this case, the identifiers for cells or nuclei fundamentally are integers, with object i composed of all pixels in the segmentation mask image that have value i. Additionally, the index for these not only has to start from 1 (due to the convention of pixel value 0 meaning "background", not part of any cell or nucleus or other type of object), but needs to be non-contiguous, so as I understand it we couldn't use a RangeIndex.

(The non-contiguous case occurs whenever cells/nuclei touch the border of the image; it isn't very meaningful to compute total or mean protein expression or cell shape when half of the cell might be cut off.)

For the moment we'll have to work around this by storing the object IDs as strings, but this is wrong -- the type of that identifier is "integer, starting from 1, with arbitrary portions of the range missing".

scverse / anndata