Questions on integration and interpretation

JABioinf commented 5 months ago

Thank you for developing Pando, it's a really nice and convenient tool for multiomic data analysis!

Following up on questions in #48 and #52 about input data when dealing with multiple samples: which assays do you recommend to use in Pando when dealing with an integrated object of multiple multiome libraries from different samples generated by Signac/Seurat pipelines (e.g. rlsi for ATAC, and sctransform -> anchor-based integration for RNA, etc.)? Can Pando take the "SCT" assay or the "integrated" assays for instance or do you recommend always starting from log-normalized "RNA" and "Peaks" assays? Will Pando's process be affected by sequencing depth differences between samples for instance? Is there any preprocess steps you recommend in that case?

I had also some questions on the output of the coef() function once the grn is obtained: how is the "corr" column computed? is it the correlation between "tf" and "target" gene expression? In this case, how did you interpret the regulatory action (activation/repression) when the sign of "corr" and "estimate" is different? should we exclude those cases from the network? It seems notably that for UMAP in get_network_graph() you give the "weighted" options where the sign of the correlation seems to be used to decide the direction instead of the estimate coefficient.

Thank you for your help!

joschif commented 4 months ago

Hi @JABioinf, I currently recommend using log-normalized data or sc-tranform data for RNA or tfidf-normalized data for ATAC. I would not recommend to use integrated 'corrected' counts since it might bias the result and it's a bit hard to predict how they got changed by the integration. That said, Pando might indeed be susceptible to differences in depth, so proper normalization is crucial. We are thinking about adding ways to better control for this in the models, but havent done so yet.

for your second question, the correlation is simply global correlation between target and tf. This can diverge from the actual coefficient in the LM, because the latter includes interactions with peak accessibility and other tfs. In general, I would rather interpret the coefficient as the indicator of regulatory interaction than the global corr. For the plotting functions, this is mostly because weighting by coexpression results in quite pretty layouts, since the GRN graph can be quite sparse.

Cheers, Jonas

yojetsharma commented 4 months ago

A follow-up question: is there a way to obtain differential gene regulatory network between control vs patient samples (merged and analysed)?

joschif commented 4 months ago

We did something similar in our paper (Fig 2) where we got differential networks between brain regions. Basically it's inferring a network based on the entire dataset or just the control and then pruning it based on diff accessibility (and/or expression). The procedure is not implemented in Pando though, so you would have to do that yourself based on the graph that Pando produces.

quadbio / Pando

Questions on integration and interpretation #55