scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
558 stars 150 forks source link

Outer concat with merge="first" filled with NaNs #614

Open wflynny opened 3 years ago

wflynny commented 3 years ago

This might be a misunderstanding on my part, but I would expect that ad.concat(*ads, merge="first", join="outer") to populate the alternative axis with first non-NaN/Null value it finds. E.g. combining two dataframes in this way would fully populate the combined .var. However, it looks like only the values of the first AnnData object are taken (even if they are NaN after the reindexing).

Here's a short example:

>>> d = sc.datasets.pbmc3k()
>>> sc.pp.calculate_qc_metrics(d, inplace=True, percent_top=None)
>>> a = d[:, d.var_names[:100]].copy()
>>> b = d[:, d.var_names[-100:]].copy()
>>> print(a,b)
(AnnData object with n_obs × n_vars = 2700 × 100
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
 AnnData object with n_obs × n_vars = 2700 × 100
     obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts'
     var: 'gene_ids', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts')

The objects have 100 genes each with no overlap. When concating them, only values from a.var are present in the concatenated object's .var.

>>> c = ad.concat([a, b], join="outer", merge="first")
>>> c.var_names.difference(a.var_names.union(b.var_names))  # check the var_names are correct, they are!
Index([], dtype='object', name='index')
>>> c.var.loc[a.var_names].isnull().any()  # all values for a.var are present
gene_ids                 False
n_cells_by_counts        False
mean_counts              False
log1p_mean_counts        False
pct_dropout_by_counts    False
total_counts             False
log1p_total_counts       False
dtype: bool
>>> c.var.loc[b.var_names].isnull().all()  # no values from b.var are present
gene_ids                 True
n_cells_by_counts        True
mean_counts              True
log1p_mean_counts        True
pct_dropout_by_counts    True
total_counts             True
log1p_total_counts       True
dtype: bool

Ultimately I'm looking for the best way to outer concat objects and preserve the annotations in .var. Looks like the best way to do so is to not filter out any genes prior to ad.concat, then concat with any non-None value of merge, then filter out genes.

ivirshup commented 2 years ago

Looks like the best way to do so is to not filter out any genes prior to ad.concat, then concat with any non-None value of merge, then filter out genes.

Yes, this probably currently is the best way to do this.

I believe we would need to introduce a merge argument value for what you want, which I think is broadly reasonable. This operation is a bit complicated though. The behavior would be like passing compat="no_conflicts" to xarray.concat.

Other corralaries: our "first" is like their "override", "same" is like "broadcast_equals", and "unique" is like if "no_conflicts" operated on the whole array/ dataframe.

The way I understand the logic here: For entries that are present in the intersection of the variable, all values must match. If they don't we drop the column. This should keep info like ensembl ids and genomic ranges. It would drop things like summary statistics over different datasets.

The example case here is a bit funny, because there is no intersection of the variables, so values would always be kept. Is this a real case, or an illustration? It seems like this would be bad for things like principal components (.varm["PCs"]), where you would end up with variable loadings from two different decompositions in one array.

aeisenbarth commented 7 months ago

We have the same issue, however we cannot mitigate it by swapping the order of filtering, since we don't use filter_genes. In our case, AnnData files come from different sources and are read in and concatenated. For common var_names, some var rows can have NaN, some have a value (all same value).

We would need a merge option that ignores null values, but prefers the first (or any) non-null value. None of the available merge options can achieve that.

The only workaround so far is to do the var merging manually:

import anndata as ad

adata1 = anndata.AnnData(obs=pd.DataFrame(index=["obs1"]), var=pd.DataFrame({"col1": [1.0, np.nan]}, index=["var1", "var2"]))
adata2 = anndata.AnnData(obs=pd.DataFrame(index=["obs2"]), var=pd.DataFrame({"col1": [2.0, 3.0]}, index=["var2", "var3"]))
adatas = [adata1, adata2]
adata_concatenated = anndata.concat(adatas, join="outer", merge="first")
# assert np.all(~np.isnan(adata_concatenated.var)) # Fails

# Update missing/NaN values in the first dataframe with values from a subsequent one.
var_first_non_nan = reduce(lambda df1, df2: df1.combine_first(df2), [a.var for a in adatas])
adata_concatenated.var = var_first_non_nan
assert np.all(~np.isnan(adata_concatenated.var))

Is there interest in a pull request? In our case, we don't have conflicts that require any more advanced merge option, but adding this feature as a variant of merge="first" opens the question of adding variants for the others as well, without an actual need.