Closed jdblischak closed 3 years ago
As we discussed in the paper, a larger theta implies stronger penalty and is more likely to detect smaller signal segments. When theta takes value at 0.7, it is possible that a significant region contains only a single base pair. The single base pair is of interest, and is probably a proper subset of a true signal region. If you wish to obtain larger regions, we recommend to use smaller theta, or the data-adaptive theta approach.
In addition, when theta is small, a true signal region may be identified as multiple smaller segments. Therefore, we recommend to merge the identified regions which are no more than 100 KB away from each other into a single region by yourself.
If you wish to obtain larger regions, we recommend to use smaller theta, or the data-adaptive theta approach.
Agreed. In my experience, the data-adaptive theta approach chooses results with larger regions. The reason I am interested in the smaller regions returned with theta=0.7 is because I am trying to automatically plot and compare the results across thetas.
The single base pair is of interest, and is probably a proper subset of a true signal region.
we recommend to merge the identified regions which are no more than 100 KB away from each other into a single region by yourself.
Given the above comments, what do you think of updating LOGODetect to return the exact base pair position for the beginning and end of the region? It seems preferable for all use cases. If the region is a single SNP, you know exactly which SNP. If multiple nearby smaller regions should be merged, knowing the exact start/stop positions of each region would make it easier to merge them. And even for the standard larger regions that LOGODetect typically identifies, I think it would be useful to know exactly which SNPs bound the region. Is there some advantage to rounding the start and stop positions of the regions that I am not considering?
I agree that returning the exact base pair position seems more preferable. The software have updated accordingly.
I have some results from using a theta of 0.7 where
begin_pos
andend_pos
are the same. I had assumed this was due to the rounding inBiScan.R
, and thus that the region was simply less than 1000 bp.https://github.com/ghm17/LOGODetect/blob/8d4c75be7d5e225a2d75c5f18adb6609149768b0/Code/BiScan.R#L142-L143
However, when I re-ran it with the debugger, I discovered that the beginning and end of the region pointed to the same exact base pair (i.e.
begin == stop
). Surprisingly, one of these single base pair regions has astat
greater than 3.Could you please help me interpret these results? Does a significant region of only a single base pair indicate 1) that this SNP alone is of particular interest, 2) a spurious result that should be ignored, or 3) a potential problem with my input data?