Open averagehat opened 8 years ago
Yep I think originally I used vectors for ease of subvec'ing intervals etc but I did a few tests and it might be better to store as a string. I've pumped up the version to 4.3 and the sequence is now stored as a string.
There is also a lot of sequencing checking that goes on so if you define your fa-file with an 'unchecked' alphabet you can avoid all that if your sure your sequences are correct or can live with spaces etc.
So just round-tripping some sequences to file:
user> (def tf (init-fasta-file "/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/trinity-estscan-nucl.fasta" :iupacNucleicAcids))
#'user/tf
user> (time (with-open [r (bs-reader tf)]
(biosequence->file (biosequence-seq r)
"/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
:append false
:func #(fasta-string % false))))
"Elapsed time: 4272.02027 msecs"
"/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
user> (def tf (init-fasta-file "/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/trinity-estscan-nucl.fasta" :uncheckedDNA))
#'user/tf
user> (time (with-open [r (bs-reader tf)]
(biosequence->file (biosequence-seq r)
"/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
:append false
:func #(fasta-string % false))))
"Elapsed time: 447.38576 msecs"
"/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
user>
Right now reading a fasta file returns a map wherein the sequence is a vector of characters rather than a string. For my purposes I need to rejoin the sequence (at which point I can dump it right into my database), like so:
This requires a lot of extra computation. I'd like to see a way where the sequence can be retrieved as a string when parsing a file (maybe this exists and I don't see it?).