Variant position in the output VCF differs from locus coordinates in the input repeat region bed file

bw2 commented 2 months ago

At some loci, the variant position in the output vcf differs from the position in the input repeat regions bed. For example, the original definition of this locus

chr1    10794708    10794716    3   3   1-10794708-10794716-GAA

got genotyped as

chr1    10794705        1-10794708-10794716-GAA      TTGGAAGAAGAA    TGGAAGAAGAA     .       .       START=10794708;END=10794716;PERIOD=3;NSKIP=0;NFILT=0;INEXACT_ALLELE=0;BPDIFFS=-1;DP=23;DSNP=0;DFLANKINDEL=0;AN=2;REFAC=1;AC=1   GT:GB:Q:PQ:DP:DSNP:DFLANKINDEL:PDP:PSNP:GLDIFF:ALLREADS:MALLREADS     0|1:0|-1:1.00:0.50:23:0:0:19.00|4.00:0|0:12.02:-1|5;0|18:-1|4;0|19

with an INFO field START (10794708) that's different from POS (10794705).

I can understand this helps represent complicated mutations, but it also makes it challenging to extract the # of repeats in each allele. Not sure if this is a reasonable feature request, but it would be helpful if LongTR could directly provide this info (ie. number of repeats in allele1 and allele2). For now, I'm thinking of using the GB field for this via

allele1_repeats = (GB1 + info[START] - info[END] + 1)/info[PERIOD]
allele2_repeats = (GB2 + info[START] - info[END] + 1)/info[PERIOD]

heliziii commented 2 months ago

Hi Ben,

Yes, this is a feature in LongTR that if there is a nearby (user-defined window, default 5bp) SNP or InDel, it will expand the repeat boundary to include the variation, but the original boundary is recorded in the start and end fields as you mentioned.

We decided to report GB instead of copy number because with imperfect repeats, like those with multiple motif sizes or truncated motifs, computing copy number can be subjective. For repeats with cleaner structure, what you are computing makes sense.

Let me know if you have other questions, and thank you again for using LongTR.

Best, Helia

bw2 commented 2 months ago

Thanks Helia.

I'm seeing POS != START and/or POS + LEN(vcf_ref_allele) - 1 != END in ~3 to 4% of non-hom-ref calls in my samples. Then, if I trim the short and long allele sequences to vcf_allele_sequence[START - POS : END - POS + 1] (using python syntax for string slicing), I see ~85 to 90% of these non-hom-ref calls then become hom-ref. Let me know if this is not what you would expect.

gymrek-lab / LongTR

Variant position in the output VCF differs from locus coordinates in the input repeat region bed file #8