Open grst opened 5 years ago
Thank you for the suggestions!
However, the details of "adding a layers attribute" will be convoluted, I guarantee. I would completely stay away from it. .raw
really is just for freezing a second version of the simple data matrix. It is useful in the canonical single-cell workflow and it is also useful in other cases where you deal with an interpretable, meaningful representation of the data that you want to keep there and another one, typically a more compressed one that you want to do computations with. But I would have been more happy if there had been a solution to the canonical workflow without any .raw
attribute whatsoever.
The .layers
attribute really is just there because of RNA velocity (La Manno et al.), loompy and our wish to also be able to use this type of information. @VolkerBergen then built scvelo as a follow up of the Giole's work, which introduces a few conceptual novelties and presents a package that harmonizes with Scanpy and AnnData. In these workflows, we typically do not set .raw
at all. We also don't actually subset to highly variable genes but we merely annotate the data matrix using sc.pp.highly_variable_genes
. In that case, you really don't need have layers in .raw
.
Overall, AnnData has reached a state of complexity that it wasn't meant to have in the beginning. It was meant to directly extend the functionality of pd.DataFrame
to an annotated data matrix, as there was nothing like that in the Python ecosystem.
So, without extremely strong arguments, I want to hold off from making Raw
more complicated.
PS: Just adding a "normal" AnnData object to another AnnData object will be very complicated when it comes to views and backing... Also you would need to "delete" (how is this even possible) more than just the data-storing attibutes. There is definitely no easy solution doing that. Adding a .shape
attribute is easy and I'm happy if you make a PR.
I obviously thought of making Raw
a base class for AnnData
, but as it came later and would have been more work with only little savings in code redundancy (not even critical code), I stayed away from it.
What do you think?
Hi @falexwolf,
thanks for explaining all that!
I wasn't aware of the way the highly_variable_genes
works, I had still been using the deprecated filter_genes_dispersion
. I agree that simply annotating the genes instead of filtering them out is the correct way to do it. This also instantly solves my problem as I can just use layers
to store the different versions of the data matrix.
P.S.: I'll propose the (trivial) PR for the .shape
attribute.
P.P.S.: with the annotation functionality of highly_variable_genes
I could even imagine the .raw
to be removed completely in a future version and instead just use a custom layer (e.g. .layers['raw']
) for plotting.
Hi,
I just noticed this conversation, and I hope you don't mind if I butt in. I like the idea of potentially doing completely without .raw
. I imagine it may be quite difficult to change the code in this way (and backward compatability would not really be possible... scanpy 2.0?), but the .raw
setup with potentially different matrix dimensions than the AnnData.X
"layer" has caused me quite some frustration. I think it has also lead to quite a bit of code complexity that could be avoided.
Thank you for your considerations! Let's keep it on the list of future considerations!
@LuckyMD @falexwolf About future considerations I think that the .raw
model can be used for multi-modal data (e.g. CITE-seq) which, I expect, will become more common.
The idea behind .raw
is that obs
is shared with adata.X
but not var
, which is the basis for multi-modal data.
Thus, instead of removing .raw
, I think that it can be repurposed.
That's a really good point actually @fidelram. I wasn't thinking about multi-modal data with different .var
at all. For multiple joint profiling data we would actually need an extension of .raw
rather than the opposite. I feel like that would change the idea behind what .raw
is though... it would need to be more general and not just a default for statistical testing and plotting of expression values. Essentially we'd need .layers
with non-shared .var
.
Indeed. As I see it, the basic structure is: layers share obs and var, multi_modals share obs only. I am hoping that most of the code for slicing and saving from .raw can be re-used for multi-modal.
On Tue, Mar 12, 2019 at 11:31 AM MalteDLuecken notifications@github.com wrote:
That's a really good point actually @fidelram https://github.com/fidelram. I wasn't thinking about multi-modal data with different .var at all. For multiple joint profiling data we would actually need an extension of .raw rather than the opposite. I feel like that would change the idea behind what .raw is though... it would need to be more general and not just a default for statistical testing and plotting of expression values. Essentially we'd need .layers with non-shared .var.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/anndata/issues/96#issuecomment-471944843, or mute the thread https://github.com/notifications/unsubscribe-auth/AEu_1UkXzsRSMzqcQDr-ER_swVjp1vTMks5vV4IZgaJpZM4Z5maN .
I tried to store multiple layers in the
.raw
adata (I want to keep both raw UMI counts and normalized, log-transformed counts for plotting). That doesn't seem to be supported at the moment.1) This seems pretty straightforward to implement. Add a
_layers
attribute, add a@property
and update the__getitem__
function here: https://github.com/theislab/anndata/blob/2ab6e8edee811990048c66e9afdc507faecb7bb6/anndata/base.py#L517I'm just not sure how I treat the case when the anndata is backed.
I could try to put a PR together.
2) Why not just using a normal
adata
object for raw and simply remove theobs
,obsm
anduns
attributes when setting it? Like that you wouldn't have to maintain a separate class with independent logic and also things like layers oradata.raw.shape
would simply work.