totajuliusd / topr

topr is a collection of plotting functions for visualizing and exploring genetic association results. Association results from multiple phenotypes can be viewed simultaneously, over the entire genome (Manhattan plot) or in the more detailed regional view.
Other
49 stars 13 forks source link

get_lead_snps region_size parameter #65

Closed kscott-1 closed 5 months ago

kscott-1 commented 6 months ago

Hi there, great package you've built here. I have a bit of an issue with how the region_size parameter is designed to work. The way it is set up, the lead snps result solely from blocks of the region size across the genome. I do not feel this is the best approach.

This allows for snps to be any arbitrary number of bp away from each other. If region_size=100000 with block 1 -> pos=1-100000 and block2 -> pos=100001-200000, the function will look for snps passing the sig threshold and then filter the lead marker in each block. The problem is then say you had results like these:

CHR POS PVAL
chr1 1000 0.00001
chr1 2500 0.035
chr1 50000 0.8
chr1 97000 0.000001
chr1 101000 0.00002
Now all of the sudden get_lead_snps will return this: CHR POS PVAL
chr1 97000 0.000001
chr1 101000 0.00002

If I specify a region size of 100000 bp, I would not want any markers within +/-50000 bp of either side of a resultant marker to also be output. This is a direct problem when annotating a manhattan because the annotate function directly calls the get_lead_snps function with this region_size logic.

totajuliusd commented 6 months ago

Hi and thank you for your message. I am aware of the issue you are pointing out, but for the work I have done so far, this functionality has sufficed with some workarounds.

I usually use the manhattan function to get a quick overview of the association peaks and their nearest genes. When I encounter cases like you describe, where two markers that are very close to one another get labelled, I usually solve it by increasing the region_size argument. If that doesnt solve it, I create a separate list containing the markers I want to label to have more control over what gets displayed. For example,

lead.snps <- get_lead_snps() %>% annotate_with_nearest_gene()
#remove or add snps to the lead.snps list
#and then plot
manhattan(list(CD_UKBB,  lead.snps), annotate=c(1e-100, 1e-9), color=c("darkblue","darkblue"))

Note that I set the first value in the vector I pass to the annotate argument to 1e-100, so that nothing gets annotated in the CD_UKBB datasets. I then assign the same color (darkblue) to both datasets so all datapoints look the same.

Having said that, package contributions that add and improve functionality are always more than welcome!