Extract pathway overdispersion data on a per cell basis

chrisley90 commented 2 years ago

Hey team,

First of all thanks for creating and maintaining such a useful package for analysis of scRNA data. I hesitated to bring this up as an issue, because I feel like it is 100% my own shortcomings as a user moreso than any issue with the package.

I've used testPathwayOverdispersion() to identify differentially expressed genesets and the results are great, but my question is how do I extract these data (the adjusted pvalues, z scores etc) on a per cell basis so I can generate publication quality figures with them. The results are very clear in the webapp, and I can see the genesets associated with each particular aspect, but I guess I'm hoping to find some sort of dataframe with the stats etc all spelled out. I've tried combing through the closed issues and tutorials and haven't been able to answer this question so I'm hoping you can help me out. If it's something very obvious that I overlooked, my apologies.

Thanks again for everything and I appreciate you taking the time to help!

Chris

evanbiederstedt commented 2 years ago

Hi @chrisley90

Thanks for the kind words.

I actually think this relates to this feature request: https://github.com/kharchenkolab/pagoda2/issues/128

I created some basic buttons for exporting *csv's from that ticket. The button looks like the old floppy disks which they exhibit in museums nowadays.

Please see the screenshots---I think this is what you what, right?

If something else, let me know...with the caveat that we're not trying to spend time implementing new features for Pagoda2 as we have another frontend application in the works.

Thanks, Evan

chrisley90 commented 2 years ago

Hey @evanbiederstedt ,

Thanks for the quick response. Fortunately, or unfortunately, I am intimately familiar with ye olde floppy disk.

Two parts to my response, and maybe the first answers my question. When I export the data I get the pathway name, the corrected score and the count. What exactly are those values representative of? I'm guessing corrected Z-score for corrected score? I see in the function itself that adjusted pvalues are generated but I can't seem to find those anywhere.

What I want is the ability to pull the raw data used to generate the aspect heatmap in the middle column of the app, as well as any statistics associated with the aspects themselves, or the genesets within the aspects so when we go to publish we have those numbers to back up our claims.

In my head each cell has an aspect score and this is what is used to generate the heatmap, and then any adj. p values for the pathway enrichment/aspect generation. Is this something that I could access from the pagoda environment with the correct commands?

Completely understand if you're not looking to implement something like this, but if it's something I could handle on my end with a little work that would be great.

Thanks again for your help, sorry to tear you away from your other projects.

Chris

evanbiederstedt commented 2 years ago

Hi Chris

I should caveat all of this by saying that I know why biologists/computational biologists love doing GO pathway analysis....but I think it really should only be used for exploratory data analysis. I'm skeptical when it's written in papers that "we found evidence of this mechanism because the hit we see via this GO pathway". Today, this feels like a procedure everyone knows is wrong, but we do it anyway because it's easy to do and there is software to do it.

What exactly are those values representative of? I'm guessing corrected Z-score for corrected score? I see in the function itself that adjusted pvalues are generated but I can't seem to find those anywhere.

This is a fair question. If you go through the walkthrough, the step calculating DE i.e.

r$getDifferentialGenes(type='PCA', verbose=TRUE, clusterType='community')

caches those results and makes them accessible via the frontend application.

You'll see the adjusted p-values calculated in that function

https://github.com/kharchenkolab/pagoda2/blob/main/R/Pagoda2.R#L885-L900

RE: interactive DE calculations I think it's calculated entirely on the frontend with JS actually: https://github.com/kharchenkolab/pagoda2/blob/b8edcbf1139cfa818753cacf9e420a014b1aa4c3/inst/rookServerDocs/js/lightDeWorker.js#L47-L231

What I want is the ability to pull the raw data used to generate the aspect heatmap in the middle column of the app, as well as any statistics associated with the aspects themselves, or the genesets within the aspects so when we go to publish we have those numbers to back up our claims.

This makes sense, and it's totally possible to do this all in R. What you'll need to do is check how the Pagoda2 object is parsed into a data structure for the web interaction, using make.p2.app(). https://github.com/kharchenkolab/pagoda2/blob/b8edcbf1139cfa818753cacf9e420a014b1aa4c3/R/pipelineHelpers.R#L347-L365

In my head each cell has an aspect score and this is what is used to generate the heatmap, and then any adj. p values for the pathway enrichment/aspect generation. Is this something that I could access from the pagoda environment with the correct commands?

You'll need to do in via R, using the pagoda2 object.

Completely understand if you're not looking to implement something like this, but if it's something I could handle on my end with a little work that would be great.

Yes, I think this is possible. (I don't think I'll have time to write something up though to help out....)

What you are describing though is writing R code to take the Pagoda2 object, access the calculated values generated, and then possibly do a few more DE calculations between clusters. You'll be able to access the values you're interested in, and then plot these.

I hope this is helpful. Thanks, Evan

chrisley90 commented 2 years ago

Hey Evan,

Thanks for the help. Agreed on the GO terms pathway analysis, if we didn't have the biology to back up our claims I would be much more skeptical. As it is though these are, like you said, useful in pointing in a particular direction.

Appreciate the direction and clarification. I'll see if I can't figure it out on my own, just knowing that it is possible is a helpful starting point. I'll mark this as closed, and should I solve the problem I'll update it here for anyone who may need that information in the future.

Chris

chrisley90 commented 2 years ago

I'm not sure if this will be useful to anyone, but in the interest of closure I found the information in these two places: First, the aspect scores can be extracted from the following location (assuming your pagoda env is named r) r$misc$pathwayOD$xv

And other pathway overdispersion analysis information can be found using: r$misc$pathwayODInfo

Once again thanks Evan for pointing me in the right direction, it would have taken infinitely longer without your help!

kharchenkolab / pagoda2

Extract pathway overdispersion data on a per cell basis #136