quinlan-lab / STRling

Detect novel (and reference) STR expansions from short-read data
MIT License
62 stars 9 forks source link

Is it okay to apply STRling on PCR-based WGS dataset? #108

Closed a7420174 closed 2 years ago

a7420174 commented 2 years ago

Hi, I'm using STRling for PCR-based WGS data and worried that it's the right way. When I read your doc and paper, I can't find any contents about this so I want to ask you.

Also, if ok, can you recommend some filters useful for short tandem repeat QC when considering PCR-based WGS data.

Thanks, JaeHyun

hdashnow commented 2 years ago

Hi JaeHyun, I haven't tested STRling on PCR+ WGS data. Based on my experience with exomes, it will likely work, but may be less accurate, especially with high GC loci. I would suggest excluding homopolymers, LCRs, seg dups, telomeres and centromeres, as described in the paper. Make sure all your controls are sequenced in the same way as your cases. Warm regards, Harriet

a7420174 commented 2 years ago

Thanks for reply, Harriet.

Then what do you think about using STRling output columns like depth as quality metrics? Can they improve the quality of STR data (e.g. depth ≥ 5)?

hdashnow commented 2 years ago

Yes, applying a depth filter would be a reasonable approach. Just so you know for typical PCR-free WGS, after applying the suggested filters you would expect to see on the order of 20-100 significant outliers per individual. If you were to then look at those near genes, that should get that number down even further. I'd suggest looking at the list if variants and see how much further filtering is needed to achieve your goals.

a7420174 commented 2 years ago

Oh, thank you. Your suggestions would be helpful!

a7420174 commented 2 years ago

Hi, Harriet. Could I ask you more questions?

I am comparing STRling and another tool, ExpansionHunter Denovo. EHdn detects lots of tandem repeats near centromeric and telomeric regions, but STRling detects a few repeats. Is there any consideration for that? And I called repeats using T2T-CHM13 reference genome, but I'm not certain that my results are reliable. Have you ever called repeats using T2T-CHM13. If so, I'd appreciate it if you share your experience.

hdashnow commented 2 years ago

Are the STRs in centromeres/telomeres or just near them? I'd be very cautious about any variant calling in centromeres/telomeres, segmental duplications or low complexity regions.

I have not tried the T2T genome yet. I can't imagine it would cause a problem. If anything, it should improve things. But I haven't assessed that specific question.

a7420174 commented 2 years ago

Umm yes they are mainly subtelomeric or peri/centromeric satellite repeats, and I also doubt the estimated size of the STRs detected by EHdn. I'm just curious about the difference in STR detection between two tools. I thought you maybe applied blacklisted region to STRling..

hdashnow commented 2 years ago

STRling reports all regions by default. Filtering for specific regions would be up to the user.

a7420174 commented 2 years ago

Aha Okay then it's likely that it results from a difference in algorithms. Thanks so much!