knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

overlapping sliding window #176

Open michaeljmetzger opened 3 years ago

michaeljmetzger commented 3 years ago

This is great! Currently it appears that the windows are non-overalpping (ie. 1-1000, 1001-2000, 2001-3000, etc). I was wondering if you have developed any way to modify the method to use an overlapping sliding window (ie. 1-1000, 101-1101, 201-1201, etc). We were thinking this would could allow for more precise definitions of the copy number breakpoints, while still using data from a large window size.

Thanks, Michael

knausb commented 3 years ago

Hi Michael,

I'm not sure I follow you, do you think you could come up with an example? I think you could come up with overlapping windows by altering the winsize parameter. But I'm not sure how to combine the different runs. Also note that the more windows you have the more computational time it will require. Because this is an analysis of heterozygous positions it needs CNV that are large relative to the rate of heterozygosity in your organism. So it will miss small features. If you're interested in precise identification of break points you may want to include coverage data, such as samtools pileup. With the caveat that Illumina coverage data is highly variable, so it has it's challenges as well.

Good luck! Brian

michaeljmetzger commented 3 years ago

Thanks for your response. My understanding of vcfR is that it breaks the genome into non-overlaping segments (windows). The winsize parameter is the length of each of these. So for the window size of 1000 the first segment would be 1-1000 and the second would be 1001-2000. For a sliding window, would be two parameters: window size and step size. For example, if you have a window size of 1000 and a step size of 100 the first segment would be 1-1000 and the second would be 100-1100. It would require some different calculation of the final coverage, as each position would be covered by multiple windows. It sounds like this has not been made for this program. If we can get it working, we can let you know. Thanks, Michael