Consensus sequences disagree with predicted numbers of repeats

MaestSi commented 3 years ago

Dear tandem-genotypes developers, I am using tandem-genoypes v1.8.3 to obtain consensus sequences for the two alleles. After running the alignment with last, I am running:

tandem-genotypes -v -o2 $TARGET $SAMPLE_NAME".maf.gz" > $SAMPLE_NAME"_tandem_genotypes_output_o2.txt"
tandem-genotypes-merge $FASTQ_READS $SAMPLE_NAME".par" $SAMPLE_NAME"_tandem_genotypes_output_o2.txt" > $SAMPLE_NAME"_lamassemble_consensus_sequences.fasta"

I am analysing some "difficult" samples, where one allele may show somatic mosaicism, namely different expansion lengths for different cells. Therefore, I am aware that the diploid assumption may not hold true in this case. In particular I am now interested in producing a consensus sequence for the wild-type allele. When looking at tandem-genotypes output (-o2 option) I can see the number of repeats for the two alleles is predicted correctly. However, neither of the two consensus sequences from lamassemble represents the wild-type allele. My questions are: 1- How are reads assigned to each allele? 2- Is it possible to retrieve which reads are used for producing consensus sequences for the two alleles?

Thanks in advance, Simone

mcfrith commented 3 years ago

1- Each read is simply assigned to the allele with nearest copy-number-change (breaking ties by choosing the shorter allele). The -v output shows each read's copy-number-change.

2- You can do this:

tandem-genotypes-merge seqs.fx tan-gen.txt > unmerged-sequences.fx

That will retrieve the reads for both alleles, mixed together. There doesn't seem to be a way to get the reads for each allele separately, maybe that should be fixed somehow...

MaestSi commented 3 years ago

Dear Martin, I think this explains the issue I faced with these "difficult" samples, since outliers due to somatic mosaicism are not treated as such, and they participate in the consensus sequence as well. Thank you for the information, Simone

mcfrith commented 3 years ago

The "consensus" should be robust to "outliers"... but only up to a point.

MaestSi commented 3 years ago

Yes, I agree. Thank you for your quick answers. Simone

mcfrith / tandem-genotypes

Consensus sequences disagree with predicted numbers of repeats #13