lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.35k stars 310 forks source link

Question: DNA string compressing #197

Closed jianshu93 closed 1 year ago

jianshu93 commented 1 year ago

Dear seqtk author,

I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.

Thanks,

Jianshu

lh3 commented 1 year ago

Read the source code of kmer counters (e.g. this) to see how this is handled.