Small tweaks to gene expression figure

jjc2718 commented 3 years ago

Switching to using scatter plot style (X/O) markers instead of red dots to distinguish genes that overlap between gene sets (see greenelab/mpmp-manuscript#41).

jjc2718 commented 3 years ago

Thanks for the feedback!

So it looks like there is a free signup for Biorender. Though I remember you said it wasn't free..so is this just a trial? Also can you put these into a ppt - I assume maybe you'd make the figure in Biorender and export it to a png to then add to slides?

To use the figures in a publication, they require you to have the paid version. Might be worth trying the free version first, though, if you're interested - I think you can still use the free figures in presentations and posters, etc (and yes, like you said you should just be able to export to a .png).

I think the "X"s are a nice way to denote overlap, however one thing that is a little hard is that you have True O that are orange in the same location as red X's. I wonder if you want to have False be grey and then the Trues can be blue/cool color and then the overlap Trues are red. I guess it depends on what you are takeaway from this slide. If its mainly True vs False then perhaps this isn't too necessary. But if you want to distinguish between the True vs True overlap then I think have more distinct colors would be helpful. Or maybe larger markers.

This makes sense! I don't know if I'm too worried about the overlap - I mostly want the X's to be visible in the random and most mutated plots. The main point is to show that all of the genes that are significantly predictable in those gene sets are also in the Vogelstein gene set, other than NSD1 in the random dataset. I don't care as much about being able to distinguish in the rightmost plot showing the Vogelstein genes.

I'll play around with some other ways of distinguishing between points, though, and see if any of them look better.

Can you remind me. So these overlap genes are those genes that are 1) found to be most predictive (i.e. data type X was able to predict mutation or not well for this gene) AND 2) the gene was curated in Vogelstein et al. as being associated with cancer?

Yeah, exactly. Before we added this info to the figure, the issue was that most of the genes showing up as well-predicted in the other gene sets (top and random) were also captured in the Vogelstein gene set, and we wanted a simple visual way to make that obvious.

I think I'm having a little trouble understanding the venn diagrams at the bottom - what is the difference between the two? Is the top one include all genes in the training set vs the bottom which only include the Vogelstein et al. genes or most mutated genes in the training data set?

I assume you mean the Venn diagrams in 02_classify_mutations/plot_expression_gene_sets.ipynb? The top one is all the genes in all of the gene sets (showing that there's not too much overlap in general), and the bottom one is all the genes where the model trained on true labels significantly outperforms the model trained on shuffled labels (showing that for the genes where our models perform well, there is a lot of overlap - or in other words, the Vogelstein et al. genes capture most of the relevant signal from the other two gene sets as well).

This is basically the same thing that the markers in the volcano plots show, just a different way of visualizing it.

jjc2718 commented 3 years ago

After thinking about it a bit I think I'm going to stick with the X vs. O for now, since I think it gets across the message that we want, but we may end up changing this in the future.

greenelab / mpmp

Small tweaks to gene expression figure #67