Open wflynny opened 3 years ago
Looks like the best way to do so is to not filter out any genes prior to ad.concat, then concat with any non-None value of merge, then filter out genes.
Yes, this probably currently is the best way to do this.
I believe we would need to introduce a merge
argument value for what you want, which I think is broadly reasonable. This operation is a bit complicated though. The behavior would be like passing compat="no_conflicts"
to xarray.concat
.
Other corralaries: our "first"
is like their "override"
, "same"
is like "broadcast_equals"
, and "unique"
is like if "no_conflicts"
operated on the whole array/ dataframe.
The way I understand the logic here: For entries that are present in the intersection of the variable, all values must match. If they don't we drop the column. This should keep info like ensembl ids and genomic ranges. It would drop things like summary statistics over different datasets.
The example case here is a bit funny, because there is no intersection of the variables, so values would always be kept. Is this a real case, or an illustration? It seems like this would be bad for things like principal components (.varm["PCs"]
), where you would end up with variable loadings from two different decompositions in one array.
We have the same issue, however we cannot mitigate it by swapping the order of filtering, since we don't use filter_genes
. In our case, AnnData files come from different sources and are read in and concatenated. For common var_names
, some var rows can have NaN, some have a value (all same value).
We would need a merge option that ignores null values, but prefers the first (or any) non-null value. None of the available merge options can achieve that.
The only workaround so far is to do the var merging manually:
import anndata as ad
adata1 = anndata.AnnData(obs=pd.DataFrame(index=["obs1"]), var=pd.DataFrame({"col1": [1.0, np.nan]}, index=["var1", "var2"]))
adata2 = anndata.AnnData(obs=pd.DataFrame(index=["obs2"]), var=pd.DataFrame({"col1": [2.0, 3.0]}, index=["var2", "var3"]))
adatas = [adata1, adata2]
adata_concatenated = anndata.concat(adatas, join="outer", merge="first")
# assert np.all(~np.isnan(adata_concatenated.var)) # Fails
# Update missing/NaN values in the first dataframe with values from a subsequent one.
var_first_non_nan = reduce(lambda df1, df2: df1.combine_first(df2), [a.var for a in adatas])
adata_concatenated.var = var_first_non_nan
assert np.all(~np.isnan(adata_concatenated.var))
Is there interest in a pull request? In our case, we don't have conflicts that require any more advanced merge option, but adding this feature as a variant of merge="first"
opens the question of adding variants for the others as well, without an actual need.
This might be a misunderstanding on my part, but I would expect that
ad.concat(*ads, merge="first", join="outer")
to populate the alternative axis with first non-NaN/Null value it finds. E.g. combining two dataframes in this way would fully populate the combined.var
. However, it looks like only the values of the first AnnData object are taken (even if they are NaN after the reindexing).Here's a short example:
The objects have 100 genes each with no overlap. When
concat
ing them, only values froma.var
are present in the concatenated object's.var
.Ultimately I'm looking for the best way to outer concat objects and preserve the annotations in
.var
. Looks like the best way to do so is to not filter out any genes prior toad.concat
, then concat with any non-None value ofmerge
, then filter out genes.