Multiple forms of raw data

ktpolanski commented 4 years ago

Hello,

It's sometimes quite handy to be able to store multiple forms of raw data - completely unfiltered, untouched counts for various processing needs, the log(CPM/100 + 1) form for marker testing and plotting, and whatever scaled/regressed form of HVGs lives in .X. I have a hazy memory of the Seurat object having all this inside around the time I made the jump to 0.3.x Scanpy. It should be possible to allow for more .raw-style fields in AnnData to store these sort of differing gene space representations? Or is this already possible in some manner?

fidelram commented 4 years ago

My recommendation is to use .layers and don't use .raw. Raw was necessary in the past because the highly variable genes algorithm will remove the non the variable genes. But now that the genes are flagged instead of removed there is no need for .raw. In the layers the information that you mention can be stored and plotted if necessary.

cartal commented 4 years ago

So, if I do:

adata_1.layers.raw_2 = adata_2.copy() I will be saving the second expression matrix (with genes and cellnames) but not just the number matrix, right?

Or is there a safer way to do this?

ktpolanski commented 4 years ago

I was under the impression that .layers had to agree in terms of dimensionality with .X?

fidelram commented 4 years ago

As @ktpolanski says, .layers share the same dimensionality of .X.

@cartal The layers behave like a dictionary. Thus you the right syntax is (assuming adata1.X and adata2.X have the same cells ids and genes.

adata1.layers['raw_counts'] = adata2.X.copy()

A practical example is:

adata = sc.read_10x_h5('10x_file.h5')
adata.layers['counts'] = adata.X.copy()
sc.pp.log1p(adata)
adata.layers['log1p'] = adata.X.copy()
sc.pp.regress_out(adata,  'n_counts')

ktpolanski commented 4 years ago

Wouldn't this result in a large dense array?

ivirshup commented 4 years ago

Layers can be sparse

ktpolanski commented 4 years ago

I'm referring to the regress_out() call, that seems to come after no subsetting of genes of any kind is done.

The whole point of .raw is that it allows for an alternate gene space to exist. Meanwhile, .layers have to agree with .X in terms of dimensionality. So if you want to do some regression, like in the prior example, you're stuck with a lower dimensionality in .X. Which loops back to my original inquiry.

vitkl commented 4 years ago

Yes a separate slot for storing untouched counts that has a different number of genes to .X would be very useful. Currently I have to keep 2 separate objects scanpy objects for raw & filtered/transformed data, and move cell attributes (like cluster assignment) between them.

fidelram commented 4 years ago

@ktpolanski Indeed, if you have lot of genes, the regress out will densify the matrix and consume lot of memory. If this is a concern you need to use .raw.

Currently, we are having some discussions to address multi-modality, which means having the same cells and .obs and different variables, eg. genes, open DNA regions, anti-body capture. This should also address the issue that @vitkl reports of having to sync between to anndata objects. See https://github.com/theislab/anndata/issues/237

LuckyMD commented 4 years ago

So if I understand correctly you would like to keep a copy of the count data and then filter the genes are store both in the same anndata object. I would love to understand why this is necessary. At the moment I see 2 reasons to filter genes:

The genes are not measured in sufficient cells, so they represent measurement noise
They are not highly variable (or other feature selection approach), and shouldn't be used for calculating an embedding.

For case 1, I would argue it's fine to filter the genes out of both the counts and the processed data as you don't want to keep noise in your dataset slowing down your computations. For case 2, we have the option to label genes as "highly variable" via adata.var['highly_variable'] to determine which genes are used downstream for embedding.

Thus, in these cases adata.raw is never needed as the dimensions of raw counts and processed adata.X should always be the same.

If I am missing an important use case, please let me know. Then we don't miss this in internal discussions on the future of .raw and .layers.

ktpolanski commented 4 years ago

So, just to repeat this again, hopefully clearly this time. It would be good to be able to store, without any sort of weird backwards hack unlabelled .obsm hoops, the following stuff in an object:

Completely raw form of count matrix, absolutely unfiltered genes, count form. This is useful for setting up classifiers and other stuff where you need to subset the gene pool in variable ways across different data. This way you don't have to worry about compatibility between unrelated objects, making coding these easier.
The above, but filtered via basic gene filtering and log(CPM/100 + 1) form. The equivalent of what currently lives in .raw. This is an all-around useful form of the data, as we all agree in the discussion.
The above, subset to highly variable genes, processed with PCA input in mind. Regression, scaling, what have you. Technically you sometimes don't need this, but you do if you want to regress or get actual scaling rather than just centering in. So you actually do.

Three different dimensionalities, the object currently only supports two. Layers do nothing to help here. I hope I've made my case clear by now.

LuckyMD commented 4 years ago

Thanks for clarifying the use case for the completely unfiltered object. I would however argue that the object does support all three cases at the moment.

Case 1: Put in raw Case 2: Put in adata.layers['lognormalized'] (or equivalent post-QC layer) Case 3: Annotate gene set to use for dimensionality reduction in adata.var['highly_variable'] as is currently the default.

The only case that is not taken into account in that separation as far as I can see is if you want to sc.pp.regress_out() only on a subset of cells.

ktpolanski commented 4 years ago

The object does not support all three cases at once. If you don't subset genes prior to regression, you end up with a gigantic dense array. As mentioned, the only way you get to skip case 3 is if you're cool with having the HVG data merely centred (not scaled to unit variance - just centred) by the PCA function innards.

LuckyMD commented 4 years ago

So as I understand the issue is the difference in size between a dense matrix subsetted to HVGs (what you would have with an additional .raw for example) and a dense matrix only gene-filtered, but not subsetted to HVGs (what you would get by using sc.pp.scale and sc.pp.regress_out() in a layer). Is that correct?

A simple way round this I could imagine would be to create a new layer which is 0 for all genes that are adata.var['highly_variable'] == False, and scaled/regressed out for all other genes. That could be a sparse matrix. I wonder what the size difference between that sparse matrix representation and the dense subsetted matrix would be. We could probably create a helper function for that if that's worth it.

ktpolanski commented 4 years ago

It seems more logical to just introduce the power to create more .raw-like fields for independent storage of whatever space/dimensionality.

That said, your thing could work too - have the primary log(CPM/100 + 1) representation around, and then some fake-sparse form of the regressed/scaled data in another layer. This way the dimensions agree, at least.

LuckyMD commented 4 years ago

Sharing of .obs and .var between data matrices is one of the design principles behind anndata. .raw was designed to be the exception as far as I understand (and one I thought might no longer be necessary with HVG flags). Making anndata more flexible in the dimensions of the data objects would mean compromising on speed and storage efficiency. But maybe others should comment on that, who are more involved with restructuring anndata.

ktpolanski commented 4 years ago

You make some solid points.

I guess a good placeholder solution is to put the raw counts data into .raw, store the filtered log(CPM/100 + 1) data in a layer, create a helper object to do any regression/scaling on, and then store the result of that in a sparse .X. In an ideal world, this would be handled by internal scanpy functions, but that sounds like effort for you all, and I can quite easily wing it from here. Thanks for entertaining the thought for a bit.

LuckyMD commented 4 years ago

Your elaboration on the use cases for everything was really helpful! Thanks a lot. We need to understand this stuff as well as we can to make anndata the best it can be.

At the moment .raw was designed for log-normalized and not corrected (scaled, regressed) data. As it's the default for sc.tl.rank_genes_groups() that's not ideal as a counts storage. May need to think about that a bit more.

The helper function should definitely be doable. If you have some time (and are doing it anyway), it would be helpful to have a draft of what this might look like. We could then test the efficiency of that compared to dense large matrices. What do you think @ivirshup @flying-sheep ?

ktpolanski commented 4 years ago

Should be easy enough to ingest a layer argument, like your UMAP function. The current way would be to juggle .X depending on what tool needs what, and that would also work, albeit be a bit ugly.

LuckyMD commented 4 years ago

I reckon .layers is the way forward as was initially suggested.

ktpolanski commented 4 years ago

Hey ya'll, so I got around to trying to implement the "sparse scaled" solution, like so:

sc.pp.log1p(adata)
adata.layers['lognorm'] = adata.X.copy()
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#create a "fake-sparse" form of the scaled data in .X, so that dimensionality can remain unchanged
#(and lognorm data can live in a layer)
bdata = adata[:, adata.var['highly_variable']]
sc.pp.scale(bdata, max_value=10)
#construct matrix as lil, which is row based. the HVG index is column based, so two transpositions are in order
X = scipy.sparse.lil_matrix(adata.shape[::-1])
X[np.arange(adata.shape[1])[adata.var['highly_variable']], :] = bdata.X.T
adata.X = X.tocsc().T
sc.tl.pca(adata, svd_solver='arpack')

It works, but then I save the object and it's 5GB to the conventional scaled object's 2GB. So this doesn't seem like an ideal solution. Thoughts?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

ivirshup commented 1 year ago

I believe this has been resolved with the introduction of layers and later MuData.

scverse / anndata

Multiple forms of raw data #274