Closed ktpolanski closed 1 year ago
My recommendation is to use .layers
and don't use .raw
. Raw was necessary in the past because the highly variable genes algorithm will remove the non the variable genes. But now that the genes are flagged instead of removed there is no need for .raw
. In the layers the information that you mention can be stored and plotted if necessary.
So, if I do:
adata_1.layers.raw_2 = adata_2.copy()
I will be saving the second expression matrix (with genes and cellnames) but not just the number matrix, right?
Or is there a safer way to do this?
I was under the impression that .layers
had to agree in terms of dimensionality with .X
?
As @ktpolanski says, .layers
share the same dimensionality of .X
.
@cartal The layers behave like a dictionary. Thus you the right syntax is (assuming adata1.X
and adata2.X
have the same cells ids and genes.
adata1.layers['raw_counts'] = adata2.X.copy()
A practical example is:
adata = sc.read_10x_h5('10x_file.h5')
adata.layers['counts'] = adata.X.copy()
sc.pp.log1p(adata)
adata.layers['log1p'] = adata.X.copy()
sc.pp.regress_out(adata, 'n_counts')
Wouldn't this result in a large dense array?
Layers can be sparse
I'm referring to the regress_out()
call, that seems to come after no subsetting of genes of any kind is done.
The whole point of .raw
is that it allows for an alternate gene space to exist. Meanwhile, .layers
have to agree with .X
in terms of dimensionality. So if you want to do some regression, like in the prior example, you're stuck with a lower dimensionality in .X
. Which loops back to my original inquiry.
Yes a separate slot for storing untouched counts that has a different number of genes to .X
would be very useful. Currently I have to keep 2 separate objects scanpy objects for raw & filtered/transformed data, and move cell attributes (like cluster assignment) between them.
@ktpolanski Indeed, if you have lot of genes, the regress out will densify the matrix and consume lot of memory. If this is a concern you need to use .raw
.
Currently, we are having some discussions to address multi-modality, which means having the same cells and .obs
and different variables, eg. genes, open DNA regions, anti-body capture. This should also address the issue that @vitkl reports of having to sync between to anndata objects. See https://github.com/theislab/anndata/issues/237
So if I understand correctly you would like to keep a copy of the count data and then filter the genes are store both in the same anndata object. I would love to understand why this is necessary. At the moment I see 2 reasons to filter genes:
For case 1, I would argue it's fine to filter the genes out of both the counts and the processed data as you don't want to keep noise in your dataset slowing down your computations. For case 2, we have the option to label genes as "highly variable" via adata.var['highly_variable']
to determine which genes are used downstream for embedding.
Thus, in these cases adata.raw
is never needed as the dimensions of raw counts and processed adata.X
should always be the same.
If I am missing an important use case, please let me know. Then we don't miss this in internal discussions on the future of .raw
and .layers
.
So, just to repeat this again, hopefully clearly this time. It would be good to be able to store, without any sort of weird backwards hack unlabelled .obsm
hoops, the following stuff in an object:
log(CPM/100 + 1)
form. The equivalent of what currently lives in .raw
. This is an all-around useful form of the data, as we all agree in the discussion.Three different dimensionalities, the object currently only supports two. Layers do nothing to help here. I hope I've made my case clear by now.
Thanks for clarifying the use case for the completely unfiltered object. I would however argue that the object does support all three cases at the moment.
Case 1: Put in raw
Case 2: Put in adata.layers['lognormalized']
(or equivalent post-QC layer)
Case 3: Annotate gene set to use for dimensionality reduction in adata.var['highly_variable']
as is currently the default.
The only case that is not taken into account in that separation as far as I can see is if you want to sc.pp.regress_out()
only on a subset of cells.
The object does not support all three cases at once. If you don't subset genes prior to regression, you end up with a gigantic dense array. As mentioned, the only way you get to skip case 3 is if you're cool with having the HVG data merely centred (not scaled to unit variance - just centred) by the PCA function innards.
So as I understand the issue is the difference in size between a dense matrix subsetted to HVGs (what you would have with an additional .raw
for example) and a dense matrix only gene-filtered, but not subsetted to HVGs (what you would get by using sc.pp.scale
and sc.pp.regress_out()
in a layer). Is that correct?
A simple way round this I could imagine would be to create a new layer which is 0 for all genes that are adata.var['highly_variable'] == False
, and scaled/regressed out for all other genes. That could be a sparse matrix. I wonder what the size difference between that sparse matrix representation and the dense subsetted matrix would be. We could probably create a helper function for that if that's worth it.
It seems more logical to just introduce the power to create more .raw
-like fields for independent storage of whatever space/dimensionality.
That said, your thing could work too - have the primary log(CPM/100 + 1)
representation around, and then some fake-sparse form of the regressed/scaled data in another layer. This way the dimensions agree, at least.
Sharing of .obs
and .var
between data matrices is one of the design principles behind anndata. .raw
was designed to be the exception as far as I understand (and one I thought might no longer be necessary with HVG flags). Making anndata more flexible in the dimensions of the data objects would mean compromising on speed and storage efficiency. But maybe others should comment on that, who are more involved with restructuring anndata.
You make some solid points.
I guess a good placeholder solution is to put the raw counts data into .raw
, store the filtered log(CPM/100 + 1)
data in a layer, create a helper object to do any regression/scaling on, and then store the result of that in a sparse .X
. In an ideal world, this would be handled by internal scanpy functions, but that sounds like effort for you all, and I can quite easily wing it from here. Thanks for entertaining the thought for a bit.
Your elaboration on the use cases for everything was really helpful! Thanks a lot. We need to understand this stuff as well as we can to make anndata the best it can be.
At the moment .raw
was designed for log-normalized and not corrected (scaled, regressed) data. As it's the default for sc.tl.rank_genes_groups()
that's not ideal as a counts storage. May need to think about that a bit more.
The helper function should definitely be doable. If you have some time (and are doing it anyway), it would be helpful to have a draft of what this might look like. We could then test the efficiency of that compared to dense large matrices. What do you think @ivirshup @flying-sheep ?
Should be easy enough to ingest a layer argument, like your UMAP function. The current way would be to juggle .X
depending on what tool needs what, and that would also work, albeit be a bit ugly.
I reckon .layers
is the way forward as was initially suggested.
Hey ya'll, so I got around to trying to implement the "sparse scaled" solution, like so:
sc.pp.log1p(adata)
adata.layers['lognorm'] = adata.X.copy()
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#create a "fake-sparse" form of the scaled data in .X, so that dimensionality can remain unchanged
#(and lognorm data can live in a layer)
bdata = adata[:, adata.var['highly_variable']]
sc.pp.scale(bdata, max_value=10)
#construct matrix as lil, which is row based. the HVG index is column based, so two transpositions are in order
X = scipy.sparse.lil_matrix(adata.shape[::-1])
X[np.arange(adata.shape[1])[adata.var['highly_variable']], :] = bdata.X.T
adata.X = X.tocsc().T
sc.tl.pca(adata, svd_solver='arpack')
It works, but then I save the object and it's 5GB to the conventional scaled object's 2GB. So this doesn't seem like an ideal solution. Thoughts?
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
I believe this has been resolved with the introduction of layers
and later MuData
.
Hello,
It's sometimes quite handy to be able to store multiple forms of raw data - completely unfiltered, untouched counts for various processing needs, the
log(CPM/100 + 1)
form for marker testing and plotting, and whatever scaled/regressed form of HVGs lives in.X
. I have a hazy memory of the Seurat object having all this inside around the time I made the jump to 0.3.x Scanpy. It should be possible to allow for more.raw
-style fields in AnnData to store these sort of differing gene space representations? Or is this already possible in some manner?