scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
585 stars 155 forks source link

layers for `.raw` #96

Open grst opened 5 years ago

grst commented 5 years ago

I tried to store multiple layers in the .raw adata (I want to keep both raw UMI counts and normalized, log-transformed counts for plotting). That doesn't seem to be supported at the moment.

1) This seems pretty straightforward to implement. Add a _layers attribute, add a @property and update the __getitem__ function here: https://github.com/theislab/anndata/blob/2ab6e8edee811990048c66e9afdc507faecb7bb6/anndata/base.py#L517

I'm just not sure how I treat the case when the anndata is backed.

I could try to put a PR together.

2) Why not just using a normal adata object for raw and simply remove the obs, obsm and uns attributes when setting it? Like that you wouldn't have to maintain a separate class with independent logic and also things like layers or adata.raw.shape would simply work.

falexwolf commented 5 years ago

Thank you for the suggestions!

However, the details of "adding a layers attribute" will be convoluted, I guarantee. I would completely stay away from it. .raw really is just for freezing a second version of the simple data matrix. It is useful in the canonical single-cell workflow and it is also useful in other cases where you deal with an interpretable, meaningful representation of the data that you want to keep there and another one, typically a more compressed one that you want to do computations with. But I would have been more happy if there had been a solution to the canonical workflow without any .raw attribute whatsoever.

The .layers attribute really is just there because of RNA velocity (La Manno et al.), loompy and our wish to also be able to use this type of information. @VolkerBergen then built scvelo as a follow up of the Giole's work, which introduces a few conceptual novelties and presents a package that harmonizes with Scanpy and AnnData. In these workflows, we typically do not set .raw at all. We also don't actually subset to highly variable genes but we merely annotate the data matrix using sc.pp.highly_variable_genes. In that case, you really don't need have layers in .raw.

Overall, AnnData has reached a state of complexity that it wasn't meant to have in the beginning. It was meant to directly extend the functionality of pd.DataFrame to an annotated data matrix, as there was nothing like that in the Python ecosystem.

So, without extremely strong arguments, I want to hold off from making Raw more complicated.

PS: Just adding a "normal" AnnData object to another AnnData object will be very complicated when it comes to views and backing... Also you would need to "delete" (how is this even possible) more than just the data-storing attibutes. There is definitely no easy solution doing that. Adding a .shape attribute is easy and I'm happy if you make a PR.

I obviously thought of making Raw a base class for AnnData, but as it came later and would have been more work with only little savings in code redundancy (not even critical code), I stayed away from it.

What do you think?

grst commented 5 years ago

Hi @falexwolf,

thanks for explaining all that!

I wasn't aware of the way the highly_variable_genes works, I had still been using the deprecated filter_genes_dispersion. I agree that simply annotating the genes instead of filtering them out is the correct way to do it. This also instantly solves my problem as I can just use layers to store the different versions of the data matrix.

P.S.: I'll propose the (trivial) PR for the .shape attribute.

P.P.S.: with the annotation functionality of highly_variable_genes I could even imagine the .raw to be removed completely in a future version and instead just use a custom layer (e.g. .layers['raw']) for plotting.

LuckyMD commented 5 years ago

Hi, I just noticed this conversation, and I hope you don't mind if I butt in. I like the idea of potentially doing completely without .raw. I imagine it may be quite difficult to change the code in this way (and backward compatability would not really be possible... scanpy 2.0?), but the .raw setup with potentially different matrix dimensions than the AnnData.X "layer" has caused me quite some frustration. I think it has also lead to quite a bit of code complexity that could be avoided.

falexwolf commented 5 years ago

Thank you for your considerations! Let's keep it on the list of future considerations!

fidelram commented 5 years ago

@LuckyMD @falexwolf About future considerations I think that the .raw model can be used for multi-modal data (e.g. CITE-seq) which, I expect, will become more common.

The idea behind .raw is that obs is shared with adata.X but not var, which is the basis for multi-modal data.

Thus, instead of removing .raw, I think that it can be repurposed.

LuckyMD commented 5 years ago

That's a really good point actually @fidelram. I wasn't thinking about multi-modal data with different .var at all. For multiple joint profiling data we would actually need an extension of .raw rather than the opposite. I feel like that would change the idea behind what .raw is though... it would need to be more general and not just a default for statistical testing and plotting of expression values. Essentially we'd need .layers with non-shared .var.

fidelram commented 5 years ago

Indeed. As I see it, the basic structure is: layers share obs and var, multi_modals share obs only. I am hoping that most of the code for slicing and saving from .raw can be re-used for multi-modal.

On Tue, Mar 12, 2019 at 11:31 AM MalteDLuecken notifications@github.com wrote:

That's a really good point actually @fidelram https://github.com/fidelram. I wasn't thinking about multi-modal data with different .var at all. For multiple joint profiling data we would actually need an extension of .raw rather than the opposite. I feel like that would change the idea behind what .raw is though... it would need to be more general and not just a default for statistical testing and plotting of expression values. Essentially we'd need .layers with non-shared .var.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/anndata/issues/96#issuecomment-471944843, or mute the thread https://github.com/notifications/unsubscribe-auth/AEu_1UkXzsRSMzqcQDr-ER_swVjp1vTMks5vV4IZgaJpZM4Z5maN .