onecodex / needletail

Fast FASTX parsing and k-mer methods in Rust
MIT License
174 stars 20 forks source link

Likely a stupid and easy to answer probem: Support for N characters. #60

Closed stela2502 closed 2 years ago

stela2502 commented 2 years ago

I assume I am using your lib in the wrong way. I am sorry I failed to find any documentation of the needletail library. Hence I come and bother you here...

I am currently trying to split fastq files based on R2 information. https://github.com/stela2502/split2samples and am using your library in the process.

When I process my real world test data with the script compiled from my repo I get this error:

./target/debug/splitp -r testData/testData_R1.fastq.gz -f testData/testData_R2.fastq.gz -o testData/output/ -s mouse

thread 'main' panicked at 'cannot decode N into 2 bit encoding', /home/med-sal/.cargo/git/checkouts/kmers-1b5669a2e3e7d3a0/6d9c502/src/naive_impl/mod.rs:30:18
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:142:14
   2: kmers::naive_impl::prelude::encode_binary
             at /home/med-sal/.cargo/git/checkouts/kmers-1b5669a2e3e7d3a0/6d9c502/src/naive_impl/mod.rs:30:18
   3: <kmers::naive_impl::kmer::Kmer as core::convert::From<&[u8]>>::from
             at /home/med-sal/.cargo/git/checkouts/kmers-1b5669a2e3e7d3a0/6d9c502/src/naive_impl/kmer.rs:185:18
   4: splitp::fill_kmer_vec
             at ./src/main.rs:93:23
   5: splitp::main
             at ./src/main.rs:183:13
   6: core::ops::function::FnOnce::call_once
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I am a bloody newbie in Rust. I have seen that you have support for 'N' characters in the code and hence I do not understand this problem. I think this is what throws the error:

let mut readereads = parse_fastx_file(&opts.reads).expect("valid path/file");
let mut kmer_vec = Vec::<u64>::with_capacity(12);

while let Some(record2) = readefile.next() {
   let seqrec = record2.expect("invalid record");
   let norm_seq = seqrec.normalize(true); //false acts the same
   let kmers = norm_seq.kmers(9);
   fill_kmer_vec(kmers, &mut kmer_vec);
}

fn fill_kmer_vec<'a>(seq: needletail::kmer::Kmers<'a>, kmer_vec: &mut Vec<u64>) {
    kmer_vec.clear();
   for km in seq {
        kmer_vec.push(Kmer::from(km).into_u64());
   }
}

I am looking forward to your answer!

Keats commented 2 years ago

We only do 2 bits encoding which does not support N. If you want encoded kmers with N, you need to use 3 bits encoding and write the code yourself

stela2502 commented 2 years ago

Hi - Thank you for the answer. I actually no not need the ones with N's in and your error made me simply exclude them. Which is a way more logical way to deal with the N kmers anyhow.

Keats commented 2 years ago

That's what we do as well