Closed bw2 closed 2 months ago
Hi Ben,
Yes, this is a feature in LongTR that if there is a nearby (user-defined window, default 5bp) SNP or InDel, it will expand the repeat boundary to include the variation, but the original boundary is recorded in the start and end fields as you mentioned.
We decided to report GB instead of copy number because with imperfect repeats, like those with multiple motif sizes or truncated motifs, computing copy number can be subjective. For repeats with cleaner structure, what you are computing makes sense.
Let me know if you have other questions, and thank you again for using LongTR.
Best, Helia
Thanks Helia.
I'm seeing POS != START and/or POS + LEN(vcf_ref_allele) - 1 != END in ~3 to 4% of non-hom-ref calls in my samples.
Then, if I trim the short and long allele sequences to vcf_allele_sequence[START - POS : END - POS + 1]
(using python syntax for string slicing), I see ~85 to 90% of these non-hom-ref calls then become hom-ref. Let me know if this is not what you would expect.
At some loci, the variant position in the output vcf differs from the position in the input repeat regions bed. For example, the original definition of this locus
got genotyped as
with an INFO field START (10794708) that's different from POS (10794705).
I can understand this helps represent complicated mutations, but it also makes it challenging to extract the # of repeats in each allele. Not sure if this is a reasonable feature request, but it would be helpful if LongTR could directly provide this info (ie. number of repeats in allele1 and allele2). For now, I'm thinking of using the GB field for this via