Geuvadis example - Githubissues

aldosc commented 1 year ago

Hi Lingfei,

I been reviewing the papers you have published on findr. I really like the concept of causal anchors to derive causal networks from human genetics and transcriptomics data. I wonder if is it possible to adapt the example provided for the Geuvadis dataset (https://doi.org/10.1007/978-1-4939-8882-2_4) to other scenarios. Is there any specific requirement or format for the geuvadis$dgt matrix? Based on the publication, my understating is that this matrix contains information of cis-eqtls and individuals, is this correct? Is it possible to use other type of data to derive this matrix? For example, data from TCGA or CCLE to create a matrix including mutations (1=True, 0=F) and patients/cells. Then, by using findr to derive causal networks by integrating the previous matrix with gene expression.
Looking forward to hearing from you.

lingfeiwang commented 1 year ago

Hi aldosc,

Thank you for your questions.

You can adapt the example easily to other datasets. The dgt matrix simply follows the same format as dt and has shape (n_sample,n_gene). Here each column $i$ should be a causal anchor or instrumental variable (https://en.wikipedia.org/wiki/Instrumental_variables_estimation) for gene $i$ in dt. Ideally we want it to be independent from other columns in dgt (e.g. no linkage disequilibrium or significant Pearson correlation), and have no other direct effect than changing gene $i$ in dt. Although these assumptions are not always satisfied in practice, Findr's statistical tests are robust to some violations. For example, dgt can be cis-eQTLs, mutations, or manually assigned perturbations. When there are multiple causal anchors available, our experience was to use the one with strongest effect on gene $i$. If there is no signicant causal anchor, you should place this gene in dt2 instead.

Therefore, you can surely use binary mutation data for dgt. For each gene, I would first select its causal anchors as the nearby or intended mutations that have strong effects on this gene's expression. Then, you can filter these causal anchors to have no strong correlation with any causal anchor of any other gene. After that, you may select the strongest causal anchor for each gene as dgt, the corresponding genes as dt, and the genes without causal anchors as dt2. These steps require thresholds that are dependent on your dataset. More stringent thresholds provide better selection at the cost of fewer causal anchors. You may try several thresholds to find the balance between network quality and size.

Feel free to post any additional questions.

Lingfei

aldosc commented 1 year ago

Hi Lingfei,

Thank you for your reply. I have one more. How about when there is no causal anchor matrix? For example, when dt and dt2 correspond to the same matrix. Then, is the resulting network a form of coexpression network based on correlation?

Thanks,

Aldo

lingfeiwang commented 1 year ago

Hi Aldo,

Sure. Without causal anchor, we can still infer the gene network but their causal relations are obviously more error-prone. We have a followup study on what to do and what to avoid at https://www.frontiersin.org/articles/10.3389/fgene.2019.01196/full. It starts with co-expression network with findr and ends with a sparse Bayesian network with lassopv, a lasso based variable selection method.

Lingfei

aldosc commented 1 year ago

Hi Lingfei,

Sorry for the late reply. Thanks for sharing the information and the papers. I just wanted to follow up on the last part of your reply. Specifically, the sparse bayesian network. What's the best way to link the results of findr with lassopv? Is there any tutorial or more detailed documentation? What's the difference of the network obtained by applying the one greedy function in findr and the one for lassopv?

Best,

Aldo

lingfeiwang commented 1 year ago

Hi Aldo,

No problem.

The findr steps up to netr_one_greedy indeed have examples such as https://github.com/lingfeiwang/findr-bin/blob/master/EXAMPLES. For other interfaces, see https://raw.githubusercontent.com/lingfeiwang/findr/master/doc.pdf. This gives a gene ordering which is equivalent with a full DAG with n*(n-1)/2 edges for n genes.

The second step is lassopv that computes the p-value for each edge. The model is $xi=\sum{j\in \mathrm{Pa}i}\beta{ji}x_j$ with L1 regularization on $\beta$, where $x_i$ is gene $i$'s expression. There is an independent example in lassopv https://cran.r-project.org/web/packages/lassopv/lassopv.pdf. You can then obtain a sparse DAG with a p-value cutoff or similarly q-value etc.

We did not produce a full tutorial because the lassopv step is really simple.

Hope that answers your questions.

Lingfei

aldosc commented 1 year ago

Hi Lingfei,

Thank you for your reply. In the lassopv documentation, the example is: pv=lassopv(x,y). Then, if I'm trying to apply lassopv on the findr output, what corresponds x and y?

Best,

Aldo

lingfeiwang commented 1 year ago

Sure. For the example above, y is the vector $x_i$ as gene $i$'s expression and x is the matrix of all $x_j$ where $j\in\mathrm{Pa}_i$. You can do it separately for each gene $i$. $\mathrm{Pa}_i$ is derived from the gene ordering by Findr.

Just let me know if anything is unclear.

Lingfei

aldosc commented 1 year ago

Hi Lingfei,

Thanks for your reply and sorry if the answers to my questions are really obvious, but still don't get it.

So, x should correspond to the output matrix from findr. For example, if I were running findr.pij_rank on the expression data from Geuvadis (geuvadis$dt2, provided in findr), I'll end up with a matrix of 3000 x 3000, this will be my x for lassopv, correct? Then, how I should construct or what's in the y vector for running lassopv in the for the above mentioned example?

Thanks!

Aldo

lingfeiwang commented 1 year ago

Hi Aldo,

Please look at the example in https://github.com/lingfeiwang/findr-R/blob/master/findr/man/findr-package.Rd. Within Findr, you can replace the first step with findr.pij_rank if there is no genotype information. The second step within Findr gives a ranking of genes. Then, for each gene, you need to run the above lassopv step separately. y is expression vector of this gene. x is the expression matrix of all genes ranked before it. They are the typical inputs for lasso regression.

lassopv does not use the output of findr.pij_rank.

Lingfei

aldosc commented 1 year ago

Hi Lingfei,

Thank you so much for your reply. I think I got it. All the information required to run lassopv is from the gene expression matrix (x and y). What findr is providing to lassopv is the ordering/ranking of each gene in y. I really appreciate all your support.

Thanks!

lingfeiwang commented 1 year ago

Glad it helped!

lingfeiwang / findr

Geuvadis example #1