scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 599 forks source link

Changing the use_raw behavior without breaking things #1798

Open gokceneraslan opened 3 years ago

gokceneraslan commented 3 years ago

It is common to store raw counts (=unnormalized) of all measured genes under adata.raw, while having normalized and unnormalized expression of a subset of genes (might be only protein coding genes, or all genes except ribosomal and mitochondrial etc) at adata.X and adata.layers['counts'] respectively. This however gives rise to a lot of trouble in plotting since visualizing raw counts is not a great idea due to the dynamic range. It is super annoying to pass use_raw=False to a lot of functions. Furthermore, weird rank_genes_groups outputs as a result of raw counts might go unnoticed because of this. (happened to me many times)

I think there was a discussion somewhere about switching to use_raw=False by default in all functions, but this may potentially break things since it's a significant behavior change. This might be reasonable for Scanpy 2.0, but not in 1.x I assume.

My suggestion is to have a use_raw option under sc.settings (i.e. the global ScanpyConfig instance) which is None by default, and can be set to False (e.g. sc.settings.use_raw=False) which would then affect all the functions with use_raw argument. This way we don't break the behavior but still have a reasonable way to turn this thing off :)

Let me know what you think.

LuckyMD commented 3 years ago

I like the thought... for exactly the reason you brought up, I recommended storing log-normalized data in adata.raw in my best practices workflow. That way DE analysis and plotting is done on that data type rather than raw counts. I have been working with adata.layers['counts'] for count data and don't keep filtered out cells/genes (easy to recreate anyway).

gokceneraslan commented 3 years ago

Yes, but if the user needs the raw counts of all genes, he/she shouldn't deal with "unnormalizing" things (which is non-trivial for beginners, but not for you 😄). So, it's better to adapt scanpy to easier workflows, not the other way around due to the limitations of scanpy.

LuckyMD commented 3 years ago

I actually meant recreate the counts by reloading the data object ;). I wonder how important it is to keep genes that are filtered out due to being expressed in too few cells anyway. I haven't come across the need to query these... and we would normally regard this as background noise anyway, no?

gokceneraslan commented 3 years ago

I actually meant recreate the counts by reloading the data object ;)

I guess I think about this because I want the h5ad file to include absolutely everything, so that it can be simply used as a single file distribute the "full dataset" (with annotations, noisy genes, raw and normalized expression, cell/gene representations etc.). Imagine you upload a single h5ad file to GEO when you publish something and you're done without thinking about how much the users can "go back" from the h5ad file. Otherwise yeah, it's possible to either unnormalize things or load the original data file.

we would normally regard this as background noise anyway, no?

This depends on how the filtering is done I think. Some people keep only protein coding genes in adata.X, which makes adata.raw even more important since all non-coding gene expression goes to adata.raw. Or miro/ribo genes are filtered out sometimes, which might be needed later on e.g. to redo qc etc.

ivirshup commented 3 years ago

As an alternative, I'd be up for just deprecating raw all together, as I think it causes more problems than it solves. I was talking about this recently with @falexwolf, who has come to a similar conclusion. This could be done on the anndata side, and just warn whenever raw is set. If no raw is present, then none of the weird behavior should come up.

I wonder how important it is to keep genes that are filtered out due to being expressed in too few cells anyway.

Might be important for integration? But hopefully this could be solvable by just knowing what annotation was used so you can safely assume the missing values are 0. Also, what level of filtering are you doing here? I've tend to go min_cells=1.

I think we do need to have a more general solution for having a "feature-select-ed" subset of the data, but think this can be done with mask argument. E.g. sc.pp.pca(adata, mask="highly_variable") (I believe we've talked about this before). This does run into memory usage problems if want do a densifying transform on the data, though I have doubts about whether this can be a good representation of the data. This can be technically solved by using a block sparse matrix type, but I'm not sure if any practically usable implementations of this are currently available.

LuckyMD commented 3 years ago

I want the h5ad file to include absolutely everything, so that it can be simply used as a single file distribute the "full dataset"

This makes more sense now. In that case however I would say that having just raw counts in adata.raw.X is fine, no? In the end you are distributing a data file. You can have your version of the normalized data in a layer... and you would be distributing your analysis code as well, so it's always clear how people should use this data file that is being deposited, no?

Might be important for integration?

Integration works better with HVGs typically, so I don't think these super lowly expressed genes are so relevant here... I would often go with min_cells=20 or even 50 for larger datasets. In the end I reason that this value will be approximately related to the size of the smallest unique cellular identity you expect to find.

This does run into memory usage problems if want do a densifying transform on the data

Don't understand this entirely... and not sure what a block sparse matrix type is... but can't you subset sparse matrices based on masks? Should be fairly easy to just skip indices that are not in the mask... although i can imagine it might be slower than doing this on dense matrices.

Based on above arguments the main issue I see is currently for the case @gokceneraslan mentioned about MT genes or non-coding genes being stored in .raw. In this case you might need these genes also during an analysis pipeline (and not just for data storage), so you would like to have them in a separate "raw" container that is otherwise not touched. This clashes with the way raw is used in current scanpy pipeline. I think we could deprecate the way .raw is used at the moment, and use a .layer for this instead (maybe a designated "raw" layer?), but then introduce a new .frozenraw or sth like that where just the raw data is stored and it's essentially read-only after assignment?

I would be a bit hesitant to not have a replacement for .raw as a version of the data that is used for DE analysis but not .X. This distinction is quite useful as it is becoming more frequent that you have 1 version of the data for further embedding-based analysis, and one for moecular analysis.

ivirshup commented 3 years ago

@gokceneraslan

I want the h5ad file to include absolutely everything, so that it can be simply used as a single file distribute the "full dataset"

As a point about this, I don't think raw completley solves this problem. There's two reasons for this:

Only a different set of variables

Raw only differs from the main object by variables. But we just as often want to remove observations (doublet detection for example). To account for this, I think it makes sense to just have two different anndata objects.

absolutely everything

I don't think we really can expect to have everything. There are always going to be analyses that require going back to the BAM

If "single file" is the issue, we could definitely allow something like:

with h5py.File("analysis.h5") as f:
    processed = ad.read_h5ad(f["processed"])
    raw = ad.read_h5ad(f["raw"])

@LuckyMD

Integration works better with HVGs typically

I'm thinking of the case where I have a few datasets saved as h5ad that I want to integrate. What if a highly variable gene in one dataset just isn't present in another? Is it because it wasn't found in that dataset at all, or because it was only present in a few cells? If it was only present in a few cells, how can I be sure a particular cell type wasn't just poorly represented in that dataset?

I feel like it's helpful to have the all the measured genes present, so that when you do gather your datasets together you can select features from the full set.

This does run into memory usage problems if want do a densifying transform on the data Don't understand this entirely...

I was thinking about what happens if you do something like sc.pp.scale, where you don't have any 0s in your expression matrix anymore, so it has to be stored as a dense matrix. I believe this is why raw was even introduced originally, since the normalization workflow then was feature selection -> scale. It was wasteful to store the entire set of variables as a dense matrix, especially since you're only using a small subset of the features.

and not sure what a block sparse matrix type is

Block sparse matrices are a good storage structure when you've got "blocks" of dense values in you matrix. For example, this is what the sparsity structure might look like in a random sparse matrix:

⡠⠄⠀⠨⡯⢀⠀⠐⢡⠀⠠⢀⠠⢂⠀⠐⠀⠐⠐⣢⠀⢂⠀⠈⠒⣂⠂⠀
⠆⢌⠁⡁⠈⠀⡀⠖⠂⠀⠁⠂⠀⠉⠐⡀⠀⠀⠈⠠⠄⠉⢀⡀⠀⠀⠀⠂
⠑⠀⠠⠀⠃⠀⠀⠅⢀⠠⠄⡀⠅⠂⢀⠪⠀⠦⢀⠀⢃⠈⢀⠌⠚⠀⠀⠃
⠁⠂⡃⠈⠀⢀⠀⠙⢀⠥⠀⠀⠄⡁⠀⠠⠈⠀⠈⠃⠂⠠⣀⠀⠈⣁⠁⠆
⡀⠐⠐⠠⠀⠐⢐⡄⣂⠀⠀⠘⠀⠀⠀⠠⠂⠀⡀⠨⠁⠀⠀⠀⠁⠁⠣⠤
⠀⡐⢀⢢⠀⠁⠔⠀⠁⠀⠃⠀⢀⢀⠐⠃⠄⠀⡇⠊⠄⠀⡈⢀⠀⠀⣀⠆
⠀⢐⣤⡄⠠⠂⠃⡈⠘⠀⠀⠀⡂⠰⢄⠊⡂⠀⠐⠂⠀⠄⠀⠀⢱⠩⠈⢀
⢁⠀⠑⠚⠁⠂⠂⠐⠁⠀⠀⢀⠠⠀⠐⠈⠈⡨⠀⠂⠀⡈⠈⠁⡐⣀⢁⠂
⠀⠀⠀⠁⠀⠠⠅⠁⡠⠇⢐⠀⠀⠖⢉⣀⠀⢀⠀⠠⡀⠀⡀⢰⠁⠂⢉⠂
⠀⠀⠀⠂⠠⢠⡁⡄⡌⠀⠀⠠⢅⠀⠄⠀⢕⢐⠀⠄⡂⢀⠂⠀⠂⠈⡸⠂
⠀⠀⠀⢐⡂⠀⢀⠐⠀⠰⡀⠑⡀⠀⠠⠀⠐⢀⠈⠆⠤⠄⢀⠀⣀⠢⡀⠀
⠂⢀⢪⢘⠀⢀⠩⠅⢄⠄⠠⠠⠐⠀⠀⢀⠠⠂⠀⠁⡘⠀⠀⠐⠢⡐⠀⠀
⢀⠌⡘⠘⠂⠄⢀⠀⢠⠔⠈⢀⠈⠀⠀⠠⡀⡂⠄⢀⠀⠀⠀⠁⠔⢈⢰⠀
⠁⠐⡀⡠⠀⠐⠠⠈⠀⢀⠀⠘⠂⠀⠀⠀⠐⠰⠄⡡⠠⡀⠀⠀⠂⠠⠁⠐

While this is one with blocks along the diagonal:

⠿⣧⣤⣤⣤⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠿⠿⠿⠿⣧⣤⣤⣤⣤⣤⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠛⠛⠛⠛⠛⠛⣤⣤⣤⣤⣤⣤⣤⣤⣤⣤⡄⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⣧⣤⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠛⢻⣶⣶⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠛⢻⣶⡆⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⣿⣿

When you have blocks of dense values, you can just store those dense blocks as regular arrays along with offsets.

but can't you subset sparse matrices based on masks? Should be fairly easy to just skip indices that are not in the mask

Yes, this should be fine. The issue I was thinking of is more when you want to do something like scale-ing your expression.

Or mito/ribo genes are filtered out sometimes, which might be needed later on e.g. to redo qc etc.

In this case you might need these genes also during an analysis pipeline (and not just for data storage), so you would like to have them in a separate "raw" container that is otherwise not touched.

If don't want them to be used as features for any analyses on X, they could be stored in obsm. If you want to use them for some analyses, (like DE), then they can just be masked out for others.

I would be a bit hesitant to not have a replacement for .raw

I think layers satisfies this. It just doesn't allow you to have a different set of variables (that is, not just a subset) for DE than the rest of the object has. But, having the different set of variables is what makes raw difficult to work with.

introduce a new .frozenraw or sth like that where just the raw data is stored and it's essentially read-only after assignment?

I'd note that .raw is already supposed to be read-only.

LuckyMD commented 3 years ago

Thanks for the explanations @ivirshup! This makes quite a bit more sense to me now (the block sparse matrix stuff).

If I understand the .raw removal alternative correctly, then you would want to add masks to every operation in scanpy that is not DE and work with .layers? I assume that e.g., MT or ribo genes are mainly removed for cellular representation analysis. Some people will also want to remove them from DE analysis to have a set of results that are easy to interpret and have less multiple testing burden. It seems to me that adding masking like this would be quite a large endeavour, no?

What if a highly variable gene in one dataset just isn't present in another? Is it because it wasn't found in that dataset at all, or because it was only present in a few cells? If it was only present in a few cells, how can I be sure a particular cell type wasn't just poorly represented in that dataset?

I don't see this as such a big issue. If you assume anything filtered out was removed because it was predominantly 0, then it would not have been included in the HVG set of that dataset anyway. So you can assume it would not be in the HVG intersection for that dataset and if you add it, then a 0 for each cell would probably not be that problematic. And whether this was due to a particular cell type being poorly represented can be answered by the gene set that you do have for these cells. Typically there is sufficient gene-gene covariance that you still keep this signal somehow.

ivirshup commented 3 years ago

If I understand the .raw removal alternative correctly, then you would want to add masks to every operation in scanpy that is not DE and work with .layers?

Pretty much every function where you would want to use highly_variable.

It seems to me that adding masking like this would be quite a large endeavour, no?

I think a similarly sized endeavor to adding highly_variable, except we can use the highly_variable code where it's been implemented. I would expect this to be less effort than supporting raw, which is a constant maintenance burden, especially for anndata.

I think this logic could be added to the _get_obs_rep, and _set_obs_rep functions.


If you assume anything filtered out was removed because it was predominantly 0

I'm not sure I like having this assumption. Especially when a collaborator asks "what about gene X", but it just wasn't in the table I received. Maybe it's an annotation issue, maybe it wasn't expressed, or maybe it wasn't expressed globally at a high enough level – but could have been expressed in the cells of interest.

you can assume it would not be in the HVG intersection for that dataset and if you add it,

Is intersection the way to go? If you have cell types which are only present in some datasets, wouldn't you want to take the union?

Typically there is sufficient gene-gene covariance that you still keep this signal somehow.

I would agree that it is unlikely that this would have a huge effect on analyses like PCA or UMAP. When it comes time to do differential expression or show expression on an embedding, then it starts to be an issue.

ivirshup commented 3 years ago

One more point for the mask argument, would be useful in plotting to allow things like plotting expression with some clusters masked out (#759).