How to interpret significant regions that are a single base pair?

ghm17 / LOGODetect

LOGODetect is a powerful tool to identify small segments that harbor local genetic correlation between two traits/diseases.

GNU General Public License v3.0

23 stars 5 forks source link

How to interpret significant regions that are a single base pair? #10

Closed jdblischak closed 3 years ago

jdblischak commented 3 years ago

I have some results from using a theta of 0.7 where begin_pos and end_pos are the same. I had assumed this was due to the rounding in BiScan.R, and thus that the region was simply less than 1000 bp.

https://github.com/ghm17/LOGODetect/blob/8d4c75be7d5e225a2d75c5f18adb6609149768b0/Code/BiScan.R#L142-L143

However, when I re-ran it with the debugger, I discovered that the beginning and end of the region pointed to the same exact base pair (i.e. begin == stop). Surprisingly, one of these single base pair regions has a stat greater than 3.

Could you please help me interpret these results? Does a significant region of only a single base pair indicate 1) that this SNP alone is of particular interest, 2) a spurious result that should be ignored, or 3) a potential problem with my input data?

ghm17 commented 3 years ago

As we discussed in the paper, a larger theta implies stronger penalty and is more likely to detect smaller signal segments. When theta takes value at 0.7, it is possible that a significant region contains only a single base pair. The single base pair is of interest, and is probably a proper subset of a true signal region. If you wish to obtain larger regions, we recommend to use smaller theta, or the data-adaptive theta approach.

In addition, when theta is small, a true signal region may be identified as multiple smaller segments. Therefore, we recommend to merge the identified regions which are no more than 100 KB away from each other into a single region by yourself.

jdblischak commented 3 years ago

If you wish to obtain larger regions, we recommend to use smaller theta, or the data-adaptive theta approach.

Agreed. In my experience, the data-adaptive theta approach chooses results with larger regions. The reason I am interested in the smaller regions returned with theta=0.7 is because I am trying to automatically plot and compare the results across thetas.

The single base pair is of interest, and is probably a proper subset of a true signal region.

we recommend to merge the identified regions which are no more than 100 KB away from each other into a single region by yourself.

Given the above comments, what do you think of updating LOGODetect to return the exact base pair position for the beginning and end of the region? It seems preferable for all use cases. If the region is a single SNP, you know exactly which SNP. If multiple nearby smaller regions should be merged, knowing the exact start/stop positions of each region would make it easier to merge them. And even for the standard larger regions that LOGODetect typically identifies, I think it would be useful to know exactly which SNPs bound the region. Is there some advantage to rounding the start and stop positions of the regions that I am not considering?

ghm17 commented 3 years ago

I agree that returning the exact base pair position seems more preferable. The software have updated accordingly.