lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.35k stars 310 forks source link

seqtk telo -m works partially with 8mer wasp telomere #201

Open jbh-cas opened 1 year ago

jbh-cas commented 1 year ago

Thank you for adding the telo command, it is much faster than the grep_telomere script I had put together.

I was hoping to use this with a wasp assembly to see its telomeres. As this pub mentions https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8977481 wasp telomeres can be from 5 to 9mers. Our assembly contains the 8mer TTATTGGG. Setting this with -m does not find any telos, whereas using the revcomp -m CCCAATAA finds this set but not the TTATTGGG at the bottom of the test file. I suspect it is the 6mer hash data structure. Thought I would mention it, though not sure how much effort you think is worth extending the -m usage scenarios.

best, Jim Henderson

Test text I have been using:

>Chr1_chal_sis
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
ATAACCGCCCTTCCTTATGGTATAACCAACCCCACTATAATATAACCACTCCCACTTACGGCTTGACCACTCCCACTTACAACTTAACCACTCCAACTTGCGGTTTAACAACCCCCCTGG
CGTTTAAATACCCCCACATACGATTTAACCACCACCCTATACAGTTTAACTACCCTACTTGCGGTTTAACAATCCCCCTGGCGGTAAACCACTCCCCACTTATGGTTTAACCACTCCCAC
TTTACGGTTTGACAAGCCTCATATTTAGGTTTAACCACCCTACTTTAAGGTTTAACCAACCCCCTTTTATAGTATAACTATCCTCACTTATACTATGACAACCCCGCTCATAGTATAACT
ATCCCAACCTAACACATAACCACCCGTGCTTACGATATAACCCATCCTACTTACACTATAACCACCTACTTACAGTGTAACTACCCCAAATTATGATATAACCATCCCCACTTACAGTAC
AACCCCCCTAAATCATAACAAAACCACTACAACTTTACGGTTTAACCACCCCAGCTCATGATTTAACTTACAACATAACCATCCACCATTGTATAACCATCCCCATTTAATACACAACTA
CCCTCGCTTACAGTATAACCTACACAATTACGGTACAACCACCCTACTCACAATATAACCACCCCTACTTACGGTATAACAATCCGTAACATACCCCCGTCGGGTATAATTATTCCGTTA
CGGCAATAACCCACCAGTTACAGCATAAACCCTAATTACGGCAATGCCTCCAGTACGATATAACCGCAATACGGTATAACCCCAGTAAGGTATTACCTCAATACAGCACACCCTTGGTAC
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
CGTTTAAATACCCCCACATACGATTTAACCACCACCCTATACAGTTTAACTACCCTACTTGCGGTTTAACAATCCCCCTGGCGGTAAACCACTCCCCACTTATGGTTTAACCACTCCCAC
lh3 commented 1 year ago

At least for human, (TTAGGG)n only occurs at the 3'-end of a chromosome. I assume Chr1_chal_sis is very long and you are only showing part of the sequence. If this is the case, seqtk telo wouldn't consider (TTATTGGG)n as part of a telomere because it is on the 5'-end.

lh3 commented 1 year ago

Just checked the paper. They were using short reads. The inverted motif could be an assembly artifact. It would be more convincing if we have HiFi reads.

jbh-cas commented 1 year ago

We do have a HiFi assembly made with hifiasm using PacBio long reads. And, as you say, the test text was a very truncated view of the Chr, thanks for pointing out the folly of using it as a reliable test.

I found the 8mer from the short read paper but confirmed its existence in our HiFi assembly using grep for the 8mer. I'll go back to that and get additional info.

Thank you.

lh3 commented 1 year ago

its existence in our HiFi assembly using grep for the 8mer.

Where are TTATTGGG 8-mers? Are they located on the 3'-ends of contigs, or following (CCCAATAA)n on the 5'-end like the example you have shown?

jbh-cas commented 1 year ago

At the bottom or very near bottom.

I made a 8Mbp test file named tst2 with the CCCAATAA at top TTATTGGG at bottom and revcomp version tst2_rc to flip them and these were the results:

$ seqtk telo -m CCCAATAA tst2
Chr1r_chal_sis  0   960 8403000
960 8403000
$ seqtk telo -m CCCAATAA tst2_rc
Chr1r_chal_sis  8402040 8403000 8403000
960 8403000
$ seqtk telo -m TTATTGGG tst2
0   8403000
$ seqtk telo -m TTATTGGG tst2_rc
0   8403000

$ grep TTATTGGG tst2_rc -n 
70019:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70020:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70021:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70022:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70023:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70024:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70025:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70026:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG

I used seqtk to fold at 120 chars for the file, so first TTATTGGG hit is at 8,402,280 bases in.

I'm sure you have more important things, I just thought having tried in this non-standard case that I would pass it along. I can always use a squishy regex with grep to find as I have been doing with the vertebrate 6mer.

As an aside, I do hope that telomere awareness can be used in assemblers and scaffolders as an option. We run a telomere script after every assembly or use of yahs and show the records and their telos as TOP, TOP_NEAR, BOTTOM, BOTTOM_NEAR or MIDDLE.

Thanks for taking the time.