santiago-es / Telometer

A simple regular expression based method for measuring telomere length from long-read sequencing
MIT License
1 stars 1 forks source link

Different telomere lengths obtained from barcoded and non-barcoded data #7

Closed jleehan closed 3 weeks ago

jleehan commented 3 weeks ago

Hello,

I really appreciate you creating this program! I am currently in the process of stress testing it for my own purposes. While multiplexing data from a run to test some treatment conditions, I noticed that the unbarcoded and barcoded data appear to be handled differently in the pipeline than the barcoded data. While it's understandable that some reads would be lost as a result of unclassified reads, I did not anticipate to receive both more telomere reads in total and greater telomere lengths in barcoded samples than in undifferentiated samples. When I take the bulk, unbarcoded fastq files, the barcoded fastq files, and the unclassified fastq files produced by this run and run them through the pipeline, I get 1110 telomere reads from the unbarcoded fastq, and 1170 telomere reads from the barcode01+barcode02+barcode03+unclassified (see figure and commands below; data taken from "telomere_length" column of telometer tsv output). Additionally, I get much longer telomere read lengths in barcode01 than captured at all in undifferentiated reads.

I was wondering if you could explain why this is happening? I've checked some of the individual reads and it doesn't seem like any trimming is occurring as a result of barcoding and I don't think that would account for the very significant differences I'm observing so far.

Thanks!

telometerbarcodingchanges1

for i in *fastq.gz
do 
sample=`basename $i | awk -F_ '{print $2}'`
minimap2 -ax map-ont -t 28 -N 5 /path/to/t2t-and-subtel.fa $i -YLo ${sample}.sam
samtools view -bho ${sample}.bam ${sample}.sam
samtools sort -o ${sample}_srt.bam ${sample}.bam
samtools index ${sample}_srt.bam
telometer -b ${sample}_srt.bam -o ${sample}_tl.tsv -m 200 -c r10
done
santiago-es commented 3 weeks ago

thanks for testing!

So if I understand correctly, this is data from one telomere capture experiment with 3 barcodes and you either A) did not demux the fastqs so all of the data was analyzed at once (undifferentiated200bp ?) versus demuxed (barcode01+02...unclassified) and obtained more telomeres in the latter instance than the former? If that is the case, that is indeed surprising as the number of telomeres should be exactly the same and the long telomeres in bc01 should be included in the undifferentiated data.

Reads with multiple mappings should be handled by only keeping the measurement from the alignment with the highest MAPQ score in the output data. Have you tried filtering the reads by MAPQ score or double checking that there are not multiple mappings slipping through in the undifferentiated or barcoded data (can do by checking that no read_ids are repeated).

I didn't think to test this particular case, but given that the only steps which filter reads are alignment and the terminal check, I'm guessing that there are differences in the minimap2 outputs between undifferentiated and barcoded reads, although I'm not certain why that would be the case either. You can also try eliminating the -N argument to minimap2. I originally used it because I was afraid of losing telomeres in supplementary alignments, but I am less concerned about this now and do not use this argument anymore but haven't updated the main page yet.

jleehan commented 3 weeks ago

So if I understand correctly, this is data from one telomere capture experiment with 3 barcodes and you either A) did not demux the fastqs so all of the data was analyzed at once (undifferentiated200bp ?) versus demuxed (barcode01+02...unclassified) and obtained more telomeres in the latter instance than the former?

Correct.

Have you tried filtering the reads by MAPQ score or double checking that there are not multiple mappings slipping through in the undifferentiated or barcoded data

You can also try eliminating the -N argument to minimap2

I'll try testing both these options tomorrow morning. Thank you!

jleehan commented 3 weeks ago

@santiago-es there are duplicate reads that are about if not exactly the number of reads by which the unclassified+barcoded reads exceed the undifferentiated reads. Those duplicate reads are sorted into both the barcodes and the unclassified reads so that's the number issue solved, but it doesn't solve the different lengths issue.

There are about 5 reads that have a MAPQ of 0 in the unclassified reads, but that doesn't explain the difference in the lengths of reads found in barcode01. Those reads are all unique within their barcode groups, but I'm not sure what other checks to do at this point.

santiago-es commented 3 weeks ago

that is very interesting. Is there any way you could send a very minimal BAM file that has the read_ids with very long telomeres in barcode01 that for some reason are shorter when analyzed in bulk? Are those reads with long telomeres in the barcode measured in the bulk analysis or are they skipped / missing?

jleehan commented 3 weeks ago

No need, this was apparently a copy/pasting into prism problem, not a problem with telometer. I was using the actual read lengths for the first iteration of this file instead of the telomere lengths and then switched over but apparently didn't replace some of the previous read lengths when doing this. My apologies!