Closed J-Wall closed 1 month ago
Yeah, I've played with exposing the internal bitvecs at various points. Being able to get Vec<usize>
is appealing but can lead to ugly issues with padding sequences that aren't perfectly aligned with 64-bits. I think exposing the bitvecs is alright, though, and doesn't set up any potential endianess issues.
I did throw in a feature gated serde derivation for Kmers
. Ideally, we could also extend the serde implementation to Seq
with this solution.
I threw in an implementation from Seq
to BitVec
as well as a feature-gated serde implementation. I think the best way to get at the Vec<usize>
is through these BitVecs.
With serde you can save sequences with something like:
let mut fa: bio_streams::fasta::Fasta<BufReader<MultiGzDecoder<File>>, Seq<Dna>> = bio_streams::fasta::Fasta::new(
BufReader::new(MultiGzDecoder::new(File::open(path).unwrap())),
);
if let Some(record) = fa.next() {
let data = bincode::serialize(&record.unwrap().seq)?;
let mut file = File::create("seq.bin")?;
file.write_all(&data)?;
}
and enabling serde in Cargo.toml
with:
bio-seq = { git = "https://github.com/jeff-k/bio-seq", features = ["serde"] }
Hey @J-Wall , I'm not 100% sure about this but here's an experimental way to expose and reconstruct from the raw &[usize]
:
let s = "TCAGCTAGCTACGACTGATCGATCGACTGATGCCGCGCGCGGCGCCGCGCGCGCGCGCCGCGCGCCCCGCGCGCGGCGCGCGCCGCGCGCGCGCGCGGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC";
let seq: Seq<Dna> = s.try_into().unwrap();
let raw: &[usize] = seq.into_raw();
println!("{raw:?}");
let new = Seq::<Dna>::from_raw(s.len(), raw);
assert_eq!(new, seq);
and the printed {raw:?}
is:
[12885487321340546439, 11067145724286887525, 7378697574509078101, 11068046444225730970, 25]
The best method for serialisation would be serde in most cases but I think there's some good cases for viewing raw bits.
I'm going to bump the version a tiny bit so that this is available on crates.io
I'm hoping to use bio-seq for bit-packed (de)serialisation of sequences. AFAICT this is not supported because there are no methods for converting between
Seq<A: Codec>
andVec<usize>
or similar. Evenimpl<A: Codec> TryFrom<Vec<u8>> for Seq<A>
expects theu8
s to be ascii characters, and not the bit-packed representation.I'm tempted to try implementing
TryFrom<Vec<usize>>
andInto<Vec<usize>>
forSeq<A: Codec>
, which reads/returns the bitpacked representation. Or maybe more simply to/fromBitVec
. Would you be interested in that as a contribution (I'm not wedded to any particular API for this)?