Open LuckyMD opened 5 years ago
I just noticed this is not implemented. Maybe we could just mirror the sc.pl.scatter()
setup for sc.pl.violin()
. If this is a good idea, I could quickly implement this...
Not only violin but most plotting options do not consider what you have in .var
. I this guarantees consistency in that what you see, for example in a violin plot, is always per obs. I think that taking the transpose, as you did, should be the right solution.
So you are saying this is a feature and not a bug 😄 . Does that mean you think one should not be able to plot .var
covariates by default?
Of course! would be wild if the plotting would internally transpose the anndata object in case one of the provided keys
exists in .var
. sc.pl.violin(adata.T, 'key')
is 100% the right thing to do.
I think the docs are a bit improvable though:
keys : str or list of str Keys for accessing variables of .var_names or fields of .obs.
The mention of var_names
here means that you can select one or more genes to plot. How can we phrase that better? Maybe we should also add an example that uses transposing.
That's the internal structure that is in the sc.pl.scatter()
function. Could just replicate that.
ouch, it’s pretty error prone to just guess! What if a column is in both .var
and .obs
? People will never figure out what they need to do in order to get what they want.
I don’t like replicating that or that it ever went into any function. Explicit is better than implicit.
We could throw a nice error if the column isn’t in .obs
but is in .var
instead, like
You specified column “dropout_per_gene” which is not in
.obs
, but in.var
. Did you mean to callsc.pl.violin(adata.T, ...)
?
@falexwolf and I discussed this at the time, and came to the conclusion we should be able to assume that the .var
columns will not be ambiguously named. And as it defaults to .obs
, it works the same as it would have done before anyway. I see the issue though, as scanpy functions do write to .obs
and .var
. On the one hand we would have to check that unambiguous naming is always upheld which can be difficult with scanpy growing, and on the other hand users may not be aware of what columns there are already in .obs
when they name .var
columns.
While I like the workaround with adata.T
, it feels strange to tell people to transpose their data to overcome a technical restriction in the plotting function. Explicit is the better solution there.
you wanted to say “implicit” right?
and I disagree, transposing here is exactly the right thing to do: when you want to use your genes as observations in a plot you should transpose your AnnData so that they are the dimension made for observations!
also, there’s no “technical restriction”. it’s about API design; we could also introduce a switch_var_and_obs = False
parameter, but i feel .T
is simpler.
No, I meant an argument like over='cells'
that would default to .obs
covariates. So explicitly telling the function. Implicit would be nice, but you pointed out that it would require guessing, which has its own issues. I'm on the fence about your solution of doing neither and requiring the user to use adata.T
.
I do see your rationale behind 'variables' and 'observations' though. I'm just not entirely sure that is clear to the user in the same way it is clear to the developer. As a user I see cells and genes in my dataset and may not be aware that one of them are treated as the variables that describe the other. Then the question is: do you want to be as user-friendly as possible (I'll call it 'the R way') or stick with consistent conventions that may not be clear to everyone ('the numpy way'?). Both can cause frustrations and both have benefits.
what if i told you that we can have our cake and eat it too? as said before:
We could throw a nice error if the column isn’t in
.obs
but is in.var
instead, likeYou specified column “dropout_per_gene” which is not in
.obs
, but in.var
. Did you mean to callsc.pl.violin(adata.T, ...)
?
Near zero frustration, because people can just do what the error tells them.
I like @flying-sheep's very last solution. To enable this for truly large-scale data and AnnData's that are backed on disk we need a much more efficient transposition implementation, which will probably need to return a view. That's problematic as it will break backwards compat (.T
returns a copy these days). But it's good as it will allow adding fields to .var
.
@LuckyMD: At the time, when you mentioned that you wanted to plot over genes in scatter, I was fine with with having the scatter wrapper and assuming no ambiguity in obs and var keys. Now, I'd advocate for @flying-sheep's solution. Of course, we'll maintain the feature in pl.scatter
when refactoring its code (a lot of it became redundant after fidel introduced the completely rewritten scatter plots).
Okay... so shall I make a PR for this? Or do you quickly want to change it @flying-sheep?
Sure, go ahead!
Hey! I thought plotting
.var
columns insc.pl.violin()
worked previously (and the error message seems to suggest this as well).I am doing the following:
and I get this error:
The whole thing works for:
So it's clearly just not taking
.var
columns forsc.pl.violing()
.I've also reproduced this with
adata = sc.datasets.blob()
.