santiago-es / Telometer

A simple regular expression based method for measuring telomere length from long-read sequencing
MIT License
0 stars 1 forks source link

What check is performed to ensure sequences derive from the bona-fide telomere terminus? #1

Closed maxfieldk closed 1 month ago

maxfieldk commented 1 month ago

Hi, Thank you for developing this incredibly useful tool! I have one question: In the text you write "A check is then performed to ensure that identified telomeric sequences are terminal and telomeres are measured from the read terminus until the sub/telomeric boundary." What is this check? My concern is that insofar as a terminal piece of DNA containing the telomere end and subtelomeric region is fragmented (suppose we have a sequence that goes: || subtelomere-10kb of repeat || and becomes || subtelomere-5kb of repeat and 5kb of repeat ||), this algorithm might toss out the sequence containing the real terminus (seeing as it does not harbor any subtelomeric sequence) and erroneously settle on a shorter telomere length than is the case if only considering the fragment which has the subtelomeric region. This concern only applies to an analysis using data produced via the standard library prep protocol (not the telorette enriched version). Thanks again!

santiago-es commented 1 month ago

Thanks for using! The relevant part of the source code for terminal telomeres is this:

    if alignment_start >= 15000 and alignment_start <= reference_genome_length - 30000:
        return None

and

    telomere_start = [m.start() for m in re.finditer(telomere_repeats_re, seq)]
    if telomere_start:
        telomere_start = telomere_start[0]
        if telomere_start > 100 and (len(seq) - telomere_start > 200):
            return None

so first the read has to align to either the first 15 kb or last 30 kb of the reference contig from the reference genome the reads were aligned to (so if you're using any custom genomes that are not CHM13 that have an odd contig configuration, might want to beware or edit the script).

The second block of code first sets the telomere start position and makes sure the telomere start site is within 100 bp of the beginning of the read and has at least 200 bp of non telomere sequence. You can play around with removing this second block and what you will find is that there will suddenly have many more telomere reads in the <100 bp range, likely just fragments created in the library prep process.

So, if I understand the example you are providing correctly, then a 10 kb subtelomere + 5 kb telomere which has become a fragmented 5 kb subtelomere + 5kb telomere should be no problem for the algorithm. However, a fragmentation pattern that results in total loss of the subtelomere (less than 200 bp of subtelomere left) will indeed lead to a read being thrown out of the analysis.

For that reason, telomere length measurement from WGS (so normal LSK114 library prep, e.g.) should be done with DNA that has not been significantly sheared or fragmented, and additionally like I mention its worth checking if you need to adjust the minimum read length considered for telomere length analysis (done with the -m flag).

I think a superior approach might be to optionally filter for read length >> telomere length, but have not tested this yet. Let me know if I understood your concern correctly.

maxfieldk commented 1 month ago

Thanks for explanation Santiago! So in this example, I assume that the subtelomere containing fragment would be counted, and so that telomere length would be mistakenly called as 5kb instead of 10? I agree with your suggest that filtering for reads >> telomere length would help this problem. But do I have it right that this would not completely fix the issue seeing as one just has to get unlucky and have a piece of DNA fragment 2kb from the terminus AND then have the subtelomere containing fraction be a long read. Have you examined whether telomere length correlates with sample N50? In any event the telorette strategy would not suffer from these concerns so I can use this when telomere length is the primary end point.

santiago-es commented 1 month ago

In the example if there is 5k of subtelomere and 5k of telomere the length will be counted as 5kb. If it was in ground truth a 10 kb telomere which became fragmented into a 5kb telomere, the fragment would also be called as 5kb, which in this case would be misleading given the ground truth, as you suggest.

Unlucky pieces of fragmented DNA are less likely for libraries that have not been intentionally fragmented + as you obtain more telomere reads.

We did analyze the read length vs telomere length for many of the runs we've done, which is how we determined that filtering for a minimum read length of 3-5 kb for human telomeres in an unfragmented library would be advisable. Take a look at one example below:

image

This is from the supplementary info of our Nature Comms manuscript. On the left side is read length vs telomere length using PacBio (telomere capture) and on the right side is read length vs telomere length using ONT sequencing (WGS), with each box representing a different min length cutoff (unfortunately you can't really do a head to head WGS telomere measurement test here since PB reads will not be long enough to measure full length telomeres from most if any fragments in a PB WGS library). As you can see, for PB HiFi measurements there is a strong correlation between RL and TL at all cutoffs as PB is highly bottlenecked by read length near the human telomere length. For ONT WGS, the correlation is generally much weaker and is almost zero at a minimum read length of 10 kb.

I'd say if you want to measure telomeres from WGS, go for as high coverage as you can get, try to get an N50 well above the average telomere length for your sample (for human you should be safe with an N50 > 25-30 kb), dont use random fragmentation methods, and try to get at least 200-250 telomere measurements (we consistently see the inflection point in the standard error of the mean measurement for random subsamples of telomeres from an experiment around 250 regardless of the sample we use, but even well below that it is <100 bp)

indapa commented 1 month ago

Thanks for using! The relevant part of the source code for terminal telomeres is this:

    if alignment_start >= 15000 and alignment_start <= reference_genome_length - 30000:
        return None

and

    telomere_start = [m.start() for m in re.finditer(telomere_repeats_re, seq)]
    if telomere_start:
        telomere_start = telomere_start[0]
        if telomere_start > 100 and (len(seq) - telomere_start > 200):
            return None

so first the read has to align to either the first 15 kb or last 30 kb of the reference contig from the reference genome the reads were aligned to (so if you're using any custom genomes that are not CHM13 that have an odd contig configuration, might want to beware or edit the script).

The second block of code first sets the telomere start position and makes sure the telomere start site is within 100 bp of the beginning of the read and has at least 200 bp of non telomere sequence. You can play around with removing this second block and what you will find is that there will suddenly have many more telomere reads in the <100 bp range, likely just fragments created in the library prep process.

So, if I understand the example you are providing correctly, then a 10 kb subtelomere + 5 kb telomere which has become a fragmented 5 kb subtelomere + 5kb telomere should be no problem for the algorithm. However, a fragmentation pattern that results in total loss of the subtelomere (less than 200 bp of subtelomere left) will indeed lead to a read being thrown out of the analysis.

For that reason, telomere length measurement from WGS (so normal LSK114 library prep, e.g.) should be done with DNA that has not been significantly sheared or fragmented, and additionally like I mention its worth checking if you need to adjust the minimum read length considered for telomere length analysis (done with the -m flag).

I think a superior approach might be to optionally filter for read length >> telomere length, but have not tested this yet. Let me know if I understood your concern correctly.

Is the source code in the repo? I didn't see any .py files.

santiago-es commented 1 month ago

No, the source code is not yet in the repo, there's still some changes and features I want to add before I upload the full source, that's why I haven't given it the full "1.0" yet. If you check out our manuscript however, you will find an archived version of the source code used to produce the figures in the manuscript at a Zenodo link.

https://doi.org/10.5281/zenodo.11058167