jeff-k / bio-seq

Bit packed and well-typed biological sequences
MIT License
21 stars 2 forks source link

Using bio-seq for sequence (de)serialisation #6

Closed J-Wall closed 1 month ago

J-Wall commented 3 months ago

I'm hoping to use bio-seq for bit-packed (de)serialisation of sequences. AFAICT this is not supported because there are no methods for converting between Seq<A: Codec> and Vec<usize> or similar. Even impl<A: Codec> TryFrom<Vec<u8>> for Seq<A> expects the u8s to be ascii characters, and not the bit-packed representation.

I'm tempted to try implementing TryFrom<Vec<usize>> and Into<Vec<usize>> for Seq<A: Codec>, which reads/returns the bitpacked representation. Or maybe more simply to/from BitVec. Would you be interested in that as a contribution (I'm not wedded to any particular API for this)?

jeff-k commented 2 months ago

Yeah, I've played with exposing the internal bitvecs at various points. Being able to get Vec<usize> is appealing but can lead to ugly issues with padding sequences that aren't perfectly aligned with 64-bits. I think exposing the bitvecs is alright, though, and doesn't set up any potential endianess issues.

I did throw in a feature gated serde derivation for Kmers. Ideally, we could also extend the serde implementation to Seq with this solution.

jeff-k commented 2 months ago

I threw in an implementation from Seq to BitVec as well as a feature-gated serde implementation. I think the best way to get at the Vec<usize> is through these BitVecs.

With serde you can save sequences with something like:

    let mut fa: bio_streams::fasta::Fasta<BufReader<MultiGzDecoder<File>>, Seq<Dna>> = bio_streams::fasta::Fasta::new(
        BufReader::new(MultiGzDecoder::new(File::open(path).unwrap())),
    );

    if let Some(record) = fa.next() {
        let data = bincode::serialize(&record.unwrap().seq)?;
        let mut file = File::create("seq.bin")?;
        file.write_all(&data)?;
    }

and enabling serde in Cargo.toml with:

bio-seq = { git = "https://github.com/jeff-k/bio-seq", features = ["serde"] }
jeff-k commented 1 month ago

Hey @J-Wall , I'm not 100% sure about this but here's an experimental way to expose and reconstruct from the raw &[usize] :

    let s = "TCAGCTAGCTACGACTGATCGATCGACTGATGCCGCGCGCGGCGCCGCGCGCGCGCGCCGCGCGCCCCGCGCGCGGCGCGCGCCGCGCGCGCGCGCGGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC";

    let seq: Seq<Dna> = s.try_into().unwrap();

    let raw: &[usize] = seq.into_raw();

    println!("{raw:?}");

    let new = Seq::<Dna>::from_raw(s.len(), raw);

    assert_eq!(new, seq);

and the printed {raw:?} is:

[12885487321340546439, 11067145724286887525, 7378697574509078101, 11068046444225730970, 25]

The best method for serialisation would be serde in most cases but I think there's some good cases for viewing raw bits.

I'm going to bump the version a tiny bit so that this is available on crates.io