gymreklab / GangSTR

A tool for profiling long STRs from short reads
GNU General Public License v2.0
85 stars 16 forks source link

What is the total processed reads per sample for a region? #87

Closed ontalin closed 4 years ago

ontalin commented 4 years ago

Hi,

I was running GangSTR on WGS bam files with an average coverage of 30X, and for a great list of positions I received the message "WARNING: Region exceeds maximum total processed reads per sample." I thought the reads per region are comparable to the local read depth and the maximum number of total processed reads at any given site for a 30X genome was approximately 30. Apparently I was wrong, for 30 should not exceed the default --max-proc-read (3000). Could you help me clarify how should I translate between local read depth and total processed reads for a region?

nmmsv commented 4 years ago

Hi, The number of processed reads includes all of the reads that are used to genotype a region. Since our model doesn't just use reads that directly overlap the region, this number can exceed the average coverage. For example, spanning reads are included in the processed reads, but they are not accounted for in the average coverage. This warning may happen due to an issue that we previously had in multi-sample processing. This issue has been resolved in the latest version of master branch. Another possible reason for this warning is that some regions of the genome are covered more deeply, and they can cause this warning. You can increase this limit (3000) using input option --max-proc-read <int>. Please let me know if the issue persists or if you had any other questions. Best, Nima

ontalin commented 4 years ago

Thank you! That was very helpful. My main concern was whether I would introduce false discovery calls by naively increasing the limit. Now I will proceed with a higher limit.

Best, Onta

nmmsv commented 4 years ago

No problem, glad I could help. I'll close this issue, but feel free to open another or send us emails with other questions. Best, Nima