Can users specify REF/ALT allele column names for topr?

jielab commented 9 months ago

Hi, there:

the topr package is really nice.

I have a lot of GWAS which have columns such as SNP A1 A2 BETA P. However, it seems that topr demands the input file to have REF and ALT column name instead. So, i had to manually change my large number of huge GWAS files.

Is there a way to allow uses to customize REF/ALT alleles, such as "--ref-column A1 --alt-column A2"?

Thanks!

Jie

totajuliusd commented 9 months ago

Hi Jie,

Currently there are no arguments in topr to do that. But you could use the dplyr::rename function to rename these columns when you provide them as input, like for example:

yourGWASfile %>% rename(REF=A1,ALT=A2)

Are you using the effectplot and/or the get_snpset functions? So for example, if you have two GWAS files named gwas1 and gwas 2, you can do:

effectplot(list(gwas1 %> %>% rename(REF=A1,ALT=A2), gwas2 %>% rename(REF=A1,ALT=A2)))

or/and

get_snpset(gwas1 %>% rename(REF=A1,ALT=A2), gwas2 %>% rename(REF=A1,ALT=A2))

Hope that helps, Tota

jielab commented 9 months ago

Dear Tota:

Thank you very much for your help!

I am trying to plot the biggest height GWAS (Nature 2022) for 5 different ancestries, posted at https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files, as shown in the following screenshot.

Since topr seems most fit to plot no more than 4 GWAS (2 on top and 2 at bottom), I used my following code to plot 4.

Below is the plot generated by topr.

Here are my questions:

As you see, this plot looks very messy. I know this is due to the fact that there are too many significnat associations for height. But is there a way to make this plot look better, through topr?
The smallest P-value of my data is 4.940656e-324, there is no P-value of 0. why the output says "60 zero-value p-values found in the input dataset and removed". More importanly, why the output plot shows the most significant P-value up to 40?

Thank you very much & best regards, Jie

jielab commented 9 months ago

Dear Tota:

You suggested me to use the following code: effectplot(list(gwas1 %> %>% rename(REF=A1,ALT=A2), gwas2 %>% rename(REF=A1,ALT=A2))) Is there a way that I only write rename(REF=A1,ALT=A2) once, instead of multiple times?

Also, another question, I now want to use the locuszoom() function from your topr package. For me to specify a "locus", should I do something like below? locuszoom(list(gwas1 %>% subset(CHR==1 & POS >1000 & POS <2000), gwas2 %>% rename(CHR==1 & POS >1000 & POS <2000)))

Thank you very much & best regards, Jie

totajuliusd commented 9 months ago

Hi again,

At the moment topr does not seem to be able to plot p-values below 1e-305 and that is something I need to look into.

I do not understand why your y-axis only goes up to 40. I tried to do the same using the following code:

Read the data into the variables datEUR,datAA,datEAS and datSAS as follows: datEUR <- read.delim("GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EUR", header=T)

do the same for AA,EAS and AS
Create a list containing the data. Filter out NA's, as well as zero, very large and small p-values (P<1e-05 and P>1e-305)
```
datlist <- list("EUR" = datEUR, "AA"=datAA, "EAS"= datEAS, "SAS"=datSAS)
datfilt =list()
```

for(name in names(datlist)){ datfilt[[name]] <- datlist[[name]] %>% na.omit() %>% filter(P<1e-05, P!=0,P>1e-305) }


3. And plot:

manhattan(datfilt, ntop=2, alpha=c(0.3,1,1,1), legend_labels=names(datfilt), color=c("gray","darkblue",get_topr_colors()[2], get_topr_colors()[3]), size=c(0.5,1,1,1), annotate=c(1e-305,1e-50,1e-50, 1e-12), angle=90, nudge_y=c(0,50,40,60), nudge_x=c(0,0,0,-7000000))


This should look like this:

![preview_giant4](https://github.com/totajuliusd/topr/assets/106526694/969b7392-844f-472b-9dc6-d489d65301d7)

Personally, I would exclude EUR from the plot and plot it separately since it has so many significant hits.

manhattan(datfilt[-1], ntop=1, legend_labels=names(datfilt[-1]), annotate=c(1e-50,1e-50, 1e-10), angle=90, nudge_y=c(30,20,30), nudge_x=c(0,0,-30000000))



![giant_preview3](https://github.com/totajuliusd/topr/assets/106526694/a5e20352-88eb-4201-a641-2bd017988d20)

totajuliusd commented 9 months ago

Regarding your second question, I would use the the same for loop syntax as in the example above to rename columns in multiple dataframes e.g.:

for(name in names(datlist)){
   datfilt[[name]] <- datlist[[name]] %>% na.omit() %>% rename(REF=A1,ALT=A2)
}

To be able to use the locuszoom function you need to have R2 already pre-calculated in the data.frame, otherwise it wont work.

jielab commented 9 months ago

Dear Tota:

Thank you so much!

I now used the following code and it works!

dir="D:/data/gwas/main/"
races=c("EUR", "AFR", "EAS", "SAS")
for (race in races) {
    datmp <- read.table(paste0(dir, 'height.', race, '.gz'), header=TRUE)[,-c(1,7,11)] %>% 
        mutate(P=ifelse(P<1e-50, 1e-50, P)) %>% filter(P<1e-05, P!=0) # %>% na.omit() 
        names(datmp) <- c("SNP", "CHR", "POS", "REF", "ALT", "BETA", "SE", "P")
    print(paste(race, nrow(datmp), ncol(datmp)))
    eval(parse(text=paste0('dat_', race, '<- datmp')))
}
datlist <- list(dat_EUR, dat_AFR, dat_EAS, dat_SAS)
topr::manhattan(datlist, ntop=2, alpha=c(0.3,1,1,1), legend_labels=races, size=c(0.5,1,1,1))

Below is the plot that I got.

I tried to generate the similar plot as that in the UK Biobank Nature paper (also on Height) (https://www.nature.com/articles/s41586-018-0579-z/figures/4). I also capped P-value at 1e-50, as that was done in the UK Biobank paper.

As you see, the plot that I generated using topr is not quite as good-looking as the one in the UK Biobank paper. Can you please help to improve my above script so that users could use topr to generate publication quality manhattan plots?

For the above Height GWAS example data, can you please also give an example locuszoom plot, assuming that I want to plot the ABO gene region?

Hopefully my questions and suggestions could improve your amazing topr a little bit further.

Thank you so much again & Best regards, Jie

totajuliusd commented 9 months ago

Dear Jie,

The figure you refer to shows giant height data from 2014, which has much fewer significant hits than the 2022 data, and they are only displaying chromsome 2 (not the whole genome).

To get a similar plot in topr, you can get the giant data from 2014 and then plot just one chromosome:

manhattan(list(giant.height.2014, gwascat.snps), 
          chr="chr2", 
          color=c("#165CAA","red"),shape=c(19,8), size=c(0.45,1.1), 
          legend_labels=c("GIANT meta-analysis (2014)","GWAS catalogue" ), 
          legend_position = "top", sign_thresh_linetype = "solid", sign_thresh_size = 0.2)

gwas_giant2014

totajuliusd commented 9 months ago

You cannot do a locuszoom plot unless you have precalculated marker correlations (R2). You can however, do a regionplot for ABO as follows:

regionplot(dat, gene="ABO")

jielab commented 8 months ago

Dear Tota:

If you scroll down a little bit, you will see that the above Figure did also display the UKB data, published in 2018.

If regionplot() could not display LD, its usefulness will be limited, because people have been used to seeing a regional plot with LD information. These days, people pre-calculated LD and posted them on web. For example, we could download the pre-calculated LD from here https://github.com/getian107/PRScs. Now, with pre-calculated LD files like these ones, how could I feed them to topr's locusplot() or regionplot() ?

Also, for get_lead_snps(), it would be good to incorporate LD threshold as well, just like the PLINK clump() command. Otherwise, it simply get_top_snp from each window size (such as 1Mb).

Thank you & best regards, Jie

totajuliusd commented 8 months ago

Dear Jie,

It sounds like locuszoom (http://locuszoom.org/) would suit your needs better. topr is not meant to be another locuszoom tool, it includes a "lightweight" locuszoom like function for those who have the pre-calculated pair-wise correlation scores for their markers.

The purpose of the regionplot function is to display smaller genetic regions and marker positions in relation to underlying and nearby genes. And as for the get_lead_snps() function, it is just supposed to be as simple as that.... only to get the top snps within the provided window size. topr is not the tool to use if you want marker correlation scores and LD information.

Bw, Tota

totajuliusd commented 8 months ago

And see the inbuilt R2_CD_UKBB dataset to see what your input file for the locuzsoom plot function should look like

> R2_CD_UKBB %>% head()
  CHROM      POS         ID           P        R2
1  chr1 67216513 rs11576518 8.03684e-20 1.0000000
2  chr1 67253446  rs1343151 3.12789e-19 0.4048777
3  chr1 67264372  rs6669582 5.65295e-19 0.4008156
4  chr1 67214307  rs2019262 1.10465e-18 0.9775277
5  chr1 67215986  rs7517847 1.30571e-18 0.9785166

In this example, rs11576518, is the lead snp, and correlation scores to neigbouring markers are in the R2 column.

jielab commented 8 months ago

Dear Tota:

Thank you very much! Usually the LD matrics has SNP paris on each row. Now, your example LD file assumes that the first SNP, or the SNP with the lowest P-value, or the SNP with R2=1, as one side of the pair, and the other SNPs in the rows as the other side of the pair.

Best regards, Jie

jielab commented 8 months ago

Dear Tota:

I know that there is locuszoom software. Although it is great, but it is NOT in R, and it could lot overlay multiple plots together as topr do.

For example, below is the user interface for locuszoom. It could do things on genome-wide scale, but it is not easy to embed this tool into my bash pipeline script, for exampe.

Best regards, Jie

totajuliusd commented 8 months ago

Dear Jie,

Ok I see, how do you normally extract information from your LD matrix file? For example if you want the correlation score between rs11576518 and rs6669582?

Bw, Tota

totajuliusd / topr

Can users specify REF/ALT allele column names for topr? #48

do the same for AA,EAS and AS