Open gokceneraslan opened 3 years ago
I like the thought... for exactly the reason you brought up, I recommended storing log-normalized data in adata.raw
in my best practices workflow. That way DE analysis and plotting is done on that data type rather than raw counts. I have been working with adata.layers['counts']
for count data and don't keep filtered out cells/genes (easy to recreate anyway).
Yes, but if the user needs the raw counts of all genes, he/she shouldn't deal with "unnormalizing" things (which is non-trivial for beginners, but not for you 😄). So, it's better to adapt scanpy to easier workflows, not the other way around due to the limitations of scanpy.
I actually meant recreate the counts by reloading the data object ;). I wonder how important it is to keep genes that are filtered out due to being expressed in too few cells anyway. I haven't come across the need to query these... and we would normally regard this as background noise anyway, no?
I actually meant recreate the counts by reloading the data object ;)
I guess I think about this because I want the h5ad file to include absolutely everything, so that it can be simply used as a single file distribute the "full dataset" (with annotations, noisy genes, raw and normalized expression, cell/gene representations etc.). Imagine you upload a single h5ad file to GEO when you publish something and you're done without thinking about how much the users can "go back" from the h5ad file. Otherwise yeah, it's possible to either unnormalize things or load the original data file.
we would normally regard this as background noise anyway, no?
This depends on how the filtering is done I think. Some people keep only protein coding genes in adata.X, which makes adata.raw even more important since all non-coding gene expression goes to adata.raw. Or miro/ribo genes are filtered out sometimes, which might be needed later on e.g. to redo qc etc.
As an alternative, I'd be up for just deprecating raw all together, as I think it causes more problems than it solves. I was talking about this recently with @falexwolf, who has come to a similar conclusion. This could be done on the anndata
side, and just warn whenever raw
is set. If no raw
is present, then none of the weird behavior should come up.
I wonder how important it is to keep genes that are filtered out due to being expressed in too few cells anyway.
Might be important for integration? But hopefully this could be solvable by just knowing what annotation was used so you can safely assume the missing values are 0. Also, what level of filtering are you doing here? I've tend to go min_cells=1
.
I think we do need to have a more general solution for having a "feature-select-ed" subset of the data, but think this can be done with mask
argument. E.g. sc.pp.pca(adata, mask="highly_variable")
(I believe we've talked about this before). This does run into memory usage problems if want do a densifying transform on the data, though I have doubts about whether this can be a good representation of the data. This can be technically solved by using a block sparse matrix type, but I'm not sure if any practically usable implementations of this are currently available.
I want the h5ad file to include absolutely everything, so that it can be simply used as a single file distribute the "full dataset"
This makes more sense now. In that case however I would say that having just raw counts in adata.raw.X
is fine, no? In the end you are distributing a data file. You can have your version of the normalized data in a layer... and you would be distributing your analysis code as well, so it's always clear how people should use this data file that is being deposited, no?
Might be important for integration?
Integration works better with HVGs typically, so I don't think these super lowly expressed genes are so relevant here... I would often go with min_cells=20
or even 50
for larger datasets. In the end I reason that this value will be approximately related to the size of the smallest unique cellular identity you expect to find.
This does run into memory usage problems if want do a densifying transform on the data
Don't understand this entirely... and not sure what a block sparse matrix type is... but can't you subset sparse matrices based on masks? Should be fairly easy to just skip indices that are not in the mask... although i can imagine it might be slower than doing this on dense matrices.
Based on above arguments the main issue I see is currently for the case @gokceneraslan mentioned about MT genes or non-coding genes being stored in .raw
. In this case you might need these genes also during an analysis pipeline (and not just for data storage), so you would like to have them in a separate "raw" container that is otherwise not touched. This clashes with the way raw is used in current scanpy pipeline. I think we could deprecate the way .raw
is used at the moment, and use a .layer
for this instead (maybe a designated "raw" layer?), but then introduce a new .frozenraw
or sth like that where just the raw data is stored and it's essentially read-only after assignment?
I would be a bit hesitant to not have a replacement for .raw
as a version of the data that is used for DE analysis but not .X
. This distinction is quite useful as it is becoming more frequent that you have 1 version of the data for further embedding-based analysis, and one for moecular analysis.
@gokceneraslan
I want the h5ad file to include absolutely everything, so that it can be simply used as a single file distribute the "full dataset"
As a point about this, I don't think raw
completley solves this problem. There's two reasons for this:
Raw only differs from the main object by variables. But we just as often want to remove observations (doublet detection for example). To account for this, I think it makes sense to just have two different anndata objects.
I don't think we really can expect to have everything. There are always going to be analyses that require going back to the BAM
If "single file" is the issue, we could definitely allow something like:
with h5py.File("analysis.h5") as f:
processed = ad.read_h5ad(f["processed"])
raw = ad.read_h5ad(f["raw"])
@LuckyMD
Integration works better with HVGs typically
I'm thinking of the case where I have a few datasets saved as h5ad
that I want to integrate. What if a highly variable gene in one dataset just isn't present in another? Is it because it wasn't found in that dataset at all, or because it was only present in a few cells? If it was only present in a few cells, how can I be sure a particular cell type wasn't just poorly represented in that dataset?
I feel like it's helpful to have the all the measured genes present, so that when you do gather your datasets together you can select features from the full set.
This does run into memory usage problems if want do a densifying transform on the data Don't understand this entirely...
I was thinking about what happens if you do something like sc.pp.scale
, where you don't have any 0s in your expression matrix anymore, so it has to be stored as a dense matrix. I believe this is why raw
was even introduced originally, since the normalization workflow then was feature selection -> scale. It was wasteful to store the entire set of variables as a dense matrix, especially since you're only using a small subset of the features.
and not sure what a block sparse matrix type is
Block sparse matrices are a good storage structure when you've got "blocks" of dense values in you matrix. For example, this is what the sparsity structure might look like in a random sparse matrix:
⡠⠄⠀⠨⡯⢀⠀⠐⢡⠀⠠⢀⠠⢂⠀⠐⠀⠐⠐⣢⠀⢂⠀⠈⠒⣂⠂⠀
⠆⢌⠁⡁⠈⠀⡀⠖⠂⠀⠁⠂⠀⠉⠐⡀⠀⠀⠈⠠⠄⠉⢀⡀⠀⠀⠀⠂
⠑⠀⠠⠀⠃⠀⠀⠅⢀⠠⠄⡀⠅⠂⢀⠪⠀⠦⢀⠀⢃⠈⢀⠌⠚⠀⠀⠃
⠁⠂⡃⠈⠀⢀⠀⠙⢀⠥⠀⠀⠄⡁⠀⠠⠈⠀⠈⠃⠂⠠⣀⠀⠈⣁⠁⠆
⡀⠐⠐⠠⠀⠐⢐⡄⣂⠀⠀⠘⠀⠀⠀⠠⠂⠀⡀⠨⠁⠀⠀⠀⠁⠁⠣⠤
⠀⡐⢀⢢⠀⠁⠔⠀⠁⠀⠃⠀⢀⢀⠐⠃⠄⠀⡇⠊⠄⠀⡈⢀⠀⠀⣀⠆
⠀⢐⣤⡄⠠⠂⠃⡈⠘⠀⠀⠀⡂⠰⢄⠊⡂⠀⠐⠂⠀⠄⠀⠀⢱⠩⠈⢀
⢁⠀⠑⠚⠁⠂⠂⠐⠁⠀⠀⢀⠠⠀⠐⠈⠈⡨⠀⠂⠀⡈⠈⠁⡐⣀⢁⠂
⠀⠀⠀⠁⠀⠠⠅⠁⡠⠇⢐⠀⠀⠖⢉⣀⠀⢀⠀⠠⡀⠀⡀⢰⠁⠂⢉⠂
⠀⠀⠀⠂⠠⢠⡁⡄⡌⠀⠀⠠⢅⠀⠄⠀⢕⢐⠀⠄⡂⢀⠂⠀⠂⠈⡸⠂
⠀⠀⠀⢐⡂⠀⢀⠐⠀⠰⡀⠑⡀⠀⠠⠀⠐⢀⠈⠆⠤⠄⢀⠀⣀⠢⡀⠀
⠂⢀⢪⢘⠀⢀⠩⠅⢄⠄⠠⠠⠐⠀⠀⢀⠠⠂⠀⠁⡘⠀⠀⠐⠢⡐⠀⠀
⢀⠌⡘⠘⠂⠄⢀⠀⢠⠔⠈⢀⠈⠀⠀⠠⡀⡂⠄⢀⠀⠀⠀⠁⠔⢈⢰⠀
⠁⠐⡀⡠⠀⠐⠠⠈⠀⢀⠀⠘⠂⠀⠀⠀⠐⠰⠄⡡⠠⡀⠀⠀⠂⠠⠁⠐
While this is one with blocks along the diagonal:
⠿⣧⣤⣤⣤⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠿⠿⠿⠿⣧⣤⣤⣤⣤⣤⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠛⠛⠛⠛⠛⠛⣤⣤⣤⣤⣤⣤⣤⣤⣤⣤⡄⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⣧⣤⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠛⢻⣶⣶⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠛⢻⣶⡆⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⣿⣿
When you have blocks of dense values, you can just store those dense blocks as regular arrays along with offsets.
but can't you subset sparse matrices based on masks? Should be fairly easy to just skip indices that are not in the mask
Yes, this should be fine. The issue I was thinking of is more when you want to do something like scale
-ing your expression.
Or mito/ribo genes are filtered out sometimes, which might be needed later on e.g. to redo qc etc.
In this case you might need these genes also during an analysis pipeline (and not just for data storage), so you would like to have them in a separate "raw" container that is otherwise not touched.
If don't want them to be used as features for any analyses on X
, they could be stored in obsm
. If you want to use them for some analyses, (like DE), then they can just be masked out for others.
I would be a bit hesitant to not have a replacement for .raw
I think layers
satisfies this. It just doesn't allow you to have a different set of variables (that is, not just a subset) for DE than the rest of the object has. But, having the different set of variables is what makes raw
difficult to work with.
introduce a new .frozenraw or sth like that where just the raw data is stored and it's essentially read-only after assignment?
I'd note that .raw
is already supposed to be read-only.
Thanks for the explanations @ivirshup! This makes quite a bit more sense to me now (the block sparse matrix stuff).
If I understand the .raw
removal alternative correctly, then you would want to add masks to every operation in scanpy that is not DE and work with .layers
? I assume that e.g., MT or ribo genes are mainly removed for cellular representation analysis. Some people will also want to remove them from DE analysis to have a set of results that are easy to interpret and have less multiple testing burden. It seems to me that adding masking like this would be quite a large endeavour, no?
What if a highly variable gene in one dataset just isn't present in another? Is it because it wasn't found in that dataset at all, or because it was only present in a few cells? If it was only present in a few cells, how can I be sure a particular cell type wasn't just poorly represented in that dataset?
I don't see this as such a big issue. If you assume anything filtered out was removed because it was predominantly 0, then it would not have been included in the HVG set of that dataset anyway. So you can assume it would not be in the HVG intersection for that dataset and if you add it, then a 0 for each cell would probably not be that problematic. And whether this was due to a particular cell type being poorly represented can be answered by the gene set that you do have for these cells. Typically there is sufficient gene-gene covariance that you still keep this signal somehow.
If I understand the .raw removal alternative correctly, then you would want to add masks to every operation in scanpy that is not DE and work with .layers?
Pretty much every function where you would want to use highly_variable
.
It seems to me that adding masking like this would be quite a large endeavour, no?
I think a similarly sized endeavor to adding highly_variable
, except we can use the highly_variable
code where it's been implemented. I would expect this to be less effort than supporting raw
, which is a constant maintenance burden, especially for anndata
.
I think this logic could be added to the _get_obs_rep
, and _set_obs_rep
functions.
If you assume anything filtered out was removed because it was predominantly 0
I'm not sure I like having this assumption. Especially when a collaborator asks "what about gene X", but it just wasn't in the table I received. Maybe it's an annotation issue, maybe it wasn't expressed, or maybe it wasn't expressed globally at a high enough level – but could have been expressed in the cells of interest.
you can assume it would not be in the HVG intersection for that dataset and if you add it,
Is intersection the way to go? If you have cell types which are only present in some datasets, wouldn't you want to take the union?
Typically there is sufficient gene-gene covariance that you still keep this signal somehow.
I would agree that it is unlikely that this would have a huge effect on analyses like PCA or UMAP. When it comes time to do differential expression or show expression on an embedding, then it starts to be an issue.
One more point for the mask argument, would be useful in plotting to allow things like plotting expression with some clusters masked out (#759).
It is common to store raw counts (=unnormalized) of all measured genes under
adata.raw
, while having normalized and unnormalized expression of a subset of genes (might be only protein coding genes, or all genes except ribosomal and mitochondrial etc) atadata.X
andadata.layers['counts']
respectively. This however gives rise to a lot of trouble in plotting since visualizing raw counts is not a great idea due to the dynamic range. It is super annoying to passuse_raw=False
to a lot of functions. Furthermore, weirdrank_genes_groups
outputs as a result of raw counts might go unnoticed because of this. (happened to me many times)I think there was a discussion somewhere about switching to
use_raw=False
by default in all functions, but this may potentially break things since it's a significant behavior change. This might be reasonable for Scanpy 2.0, but not in 1.x I assume.My suggestion is to have a
use_raw
option undersc.settings
(i.e. the globalScanpyConfig
instance) which isNone
by default, and can be set toFalse
(e.g.sc.settings.use_raw=False
) which would then affect all the functions withuse_raw
argument. This way we don't break the behavior but still have a reasonable way to turn this thing off :)Let me know what you think.