onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
93 stars 8 forks source link

u64 integer with u16 length limit results in large files having vector overflow error #63

Closed NateBaileyTempus closed 1 year ago

NateBaileyTempus commented 1 year ago

Issue when running large files through finch u64 vector is constrained by u16 length of 65536 resulting in the error:

thread 'main' panicked at 'index out of bounds: the len is 65536 but the index is 126497'

suggested fix is to increase the length of this vector u32 would do for all file types and file sizes

pub fn hist(sketch: &[KmerCount]) -> Vec<u64> { let mut counts = vec![0u64; 65536]; let mut max_count: u64 = 0; for kmer in sketch { max_count = cmp::max(max_count, u64::from(kmer.count)); counts[kmer.count as usize - 1] += 1; } counts.truncate(max_count as usize); counts }

Keats commented 1 year ago

Hey Nate, can you have a look at https://github.com/onecodex/finch-rs/pull/64?

Hope you're doing well!

NateBaileyTempus commented 1 year ago

Oh! HashMap probably a way better solution! Doing good, will check it out and let you know if it runs without error!

Keats commented 1 year ago

It will be slower than the current code though, we don't have benchmarks sadly :/

NateBaileyTempus commented 1 year ago

This new branch worked like a champ on the files that failed with the previous iteration! I had four files fail with the same error but with different vector lengths over the 65k and this time they all worked!