markschl / seq_io

FASTA and FASTQ parsing in Rust
MIT License
68 stars 11 forks source link

Issue when get the size of sequence length in reference genome file #14

Closed huangnengCSU closed 1 year ago

huangnengCSU commented 1 year ago

Hi developer, When I used seq_io::fasta::Reader to load reference genome (such as GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna), the size of each chromosome sequence was not corrected (larger than the true sequence length). This is because in reference fasta file the sequence of each chromosome is divided into multiple lines. And I think the size of chromosome sequence in seq_io::fasta::Reader includes all LFs when calculate the sequence length.

Best, Neng

markschl commented 1 year ago

Hi! Could you maybe post some example code how you determined the sequence length? This would help me reproducing it. Actually, if you follow this example, the length should be correct, since the individual sequence lines should not have any CR/LF in them. In contrast, Record::seq() does contain all line endings and the length of that slice will be larger than the actual length.