Understanding the inputs

gouthamatla commented 8 months ago

Hi,

Thanks for this really great paper. I would like to run the network expansion and module detection and trait correlations on my own set of genes. When I look at the script Script_1_SEED.r , the input seems to be clear, it is list of genes for a given disease. I also noticed that a same gene repeated for a given EFO terms. E.g,

> all.gene.gwas=readRDS("tables_expansion/all.gene.gwas_filter_GP.rds")
> head(all.gene.gwas[all.gene.gwas[, "disease"] == "EFO_0000400", ])
     gene              padj         disease      
[1,] "ENSG00000130208" "51.6449392" "EFO_0000400"
[2,] "ENSG00000108175" "62.8567219" "EFO_0000400"
[3,] "ENSG00000108175" "71.6188431" "EFO_0000400"
[4,] "ENSG00000108175" "73.3210683" "EFO_0000400"
[5,] "ENSG00000101076" "87.0205224" "EFO_0000400"
[6,] "ENSG00000136158" "62.093848"  "EFO_0000400"

Preseumably it is because same genes are linked to multiple traits with in the EFO_0000400 umbrella ? What are these padj values ? Are these some sort of OpenTarget scores ?

However, when I looked at the input for IBD section, it is a list of genes for each IBD target genes. This is slightly confusing.

> all.gene.gwas=readRDS("tables_IBD/set1/all_gene_gwas_FILTER_set1.rds")
> head(all.gene.gwas)
     gene              padj disease
[1,] "ENSG00000085978" "10" "ADCY7"
[2,] "ENSG00000112182" "10" "ADCY7"
[3,] "ENSG00000187796" "10" "ADCY7"
[4,] "ENSG00000176920" "10" "ADCY7"
[5,] "ENSG00000109758" "10" "ADCY7"
[6,] "ENSG00000101076" "10" "ADCY7"

I am wondering if there is better documentation to start from a given gene-disease pairs ?


gene,padj,disease
ENSG00000135100,10,Dis1
ENSG00000139515,10,Dis1
ENSG00000117707,10,Dis1
ENSG00000016082,10,Dis1
ENSG00000135100,10,Dis2
ENSG00000117707,10,Dis2

Is there a way to get pleiotropic gene modules without using the EFO terms as I have well known gwas IDs that I would like to use.

ibarrioh commented 8 months ago

Hi, concerning the first question (several genes to same disease), it is exactly what you guessed. This could happen as well if you download data from different repositories where you can have more than one gene->disease association but with different scores (eg from different studies). The first function in the network method makes sure that prior to propagation, all redundancies are resolved and the better scored is kept (you have to be careful when deffining this and modify accordingly in case the score is a p value, for example. in that case you can just -log transform it). The column padj correspond to the numeric value that measures gene->disease association, in your first example this is the score the genetic portal provides but in the second as there is no score measuring this, I imputed '10s'. Concerning your last question, you can calculate pleitropic modules using whatever you want in terms of genes and diseases, of course the gene IDs have to be the same for your input and the interactome which can also be modified. I hope you find theses lines useful, best, Inigo

gouthamatla commented 8 months ago

Thanks, it definitely answer my questions. Just one thing, the input in IBD example, we have genes like ADCY7, TYK2 instead of disease ID. So wondering if that was intentional ?

ibarrioh commented 8 months ago

Yes, the IBD analysis is slightly different from the rest of the paper: briefly we are measuring the amount of signal a given starting hit gets when it is left behind (the gene that is the disease column is the one that was left out for each iteration).

gouthamatla commented 8 months ago

Thanks. Very helpful.

ibarrioh / Network_expansion

Understanding the inputs #1