mcfrith / tandem-genotypes

GNU General Public License v3.0
45 stars 7 forks source link

How to visualize reads containing expansions #20

Open gspirito opened 10 months ago

gspirito commented 10 months ago

Hello, here's my issue:

I ran tandem-genotypes on long reads (Oxford Nanopore) on a RepeatMasker locus and obtained this result: chr11 70487135 70487173 TGC SHANK2 coding 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,2,2,3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,2,3

Therefore there should be 13 reads with additional copies of the sequence 'TGC' compared to the reference genome. However, if I extract all reads mapping to the locus 'chr11:70487135-70487173' from the MAF file and convert it to BAM (with LAST), I cannot see any insertion with IGV, in any read mapped to that locus.

How can I visualize the STR expansions? Is there a way to know which specific reads support the expansions?

Thanks in advance,

Giovanni

mcfrith commented 10 months ago

Many thanks for your interest in tandem-genotypes. What you're doing seems correct: I don't know why it doesn't work. Maybe if you could share your intermediate files...

To know which reads support the expansions, you can use tandem-genotypes option -v.

gspirito commented 10 months ago

Thank you very much for the answer, I attach the locus I used for the analysis, the result I got from Tandem-genotypes and the MAF file containing the reads mapping to that locus:

SHANK2_locus_rpmsk.txt SAMPLE_tg_SHANK2.txt SAMPLE_MAF.txt

mcfrith commented 10 months ago

Thanks for this interesting example! In short, tandem-genotypes is "working as designed", but the design isn't looking good in this case.

It's faithfully following the "tandem-genotypes method" in here: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1667-6

This dotplot shows the alignment (red) of one read that supposedly has 3 additional copies of TGC: zoomin

To the left of the repeat (purple), there's an insertion and deletion almost adjacent to each other. tandem-genotypes is counting the insertion as a repeat expansion. It counts insertions that are slightly outside the repeat: we found it necessary to do that in general, because the precise boundaries of repeats can be fuzzy and ambiguous (for non-exact repeats).

You could use tandem-genotypes option -n20 (to only count insertions <= 20 bp outside the repeat, instead of 60).

Maybe tandem-genotypes should be changed like this: when an insertion and deletion are so close to each other, merge them into one "in-del".

gspirito commented 5 months ago

Hi, thank you for the response, may you provide the command to make the plot you showed? Thank you very much

mcfrith commented 5 months ago

Amazingly, it's still in my shell's history: grep -B3 6f8e3f3a SAMPLE_MAF.txt | last-dotplot -a SHANK2_locus_rpmsk.txt -1 chr11:70487085-70487223 - myfig.png

gspirito commented 5 months ago

Thank you very much! Sorry for the delay in my message