s312569 / clj-biosequence

A Clojure library designed to make the manipulation of biological sequence data easier.
76 stars 11 forks source link

Sequence as plain string #31

Open averagehat opened 8 years ago

averagehat commented 8 years ago

Right now reading a fasta file returns a map wherein the sequence is a vector of characters rather than a string. For my purposes I need to rejoin the sequence (at which point I can dump it right into my database), like so:

(with-open [r (bs-reader fa-file)] 
  (print 
  (map 
   #(assoc % 
       :sequence (str/join (:sequence %))) 
 (biosequence-seq r) )))

This requires a lot of extra computation. I'd like to see a way where the sequence can be retrieved as a string when parsing a file (maybe this exists and I don't see it?).

s312569 commented 8 years ago

Yep I think originally I used vectors for ease of subvec'ing intervals etc but I did a few tests and it might be better to store as a string. I've pumped up the version to 4.3 and the sequence is now stored as a string.

There is also a lot of sequencing checking that goes on so if you define your fa-file with an 'unchecked' alphabet you can avoid all that if your sure your sequences are correct or can live with spaces etc.

So just round-tripping some sequences to file:

user> (def tf (init-fasta-file "/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/trinity-estscan-nucl.fasta" :iupacNucleicAcids))

#'user/tf
user> (time (with-open [r (bs-reader tf)]
              (biosequence->file (biosequence-seq r)
                                 "/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
                                 :append false
                                 :func #(fasta-string % false))))
"Elapsed time: 4272.02027 msecs"
"/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"

user> (def tf (init-fasta-file "/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/trinity-estscan-nucl.fasta" :uncheckedDNA))
#'user/tf
user> (time (with-open [r (bs-reader tf)]
              (biosequence->file (biosequence-seq r)
                                 "/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
                                 :append false
                                 :func #(fasta-string % false))))
"Elapsed time: 447.38576 msecs"
"/home/jason/Dropbox/jellydb/resources/test-data/trinity-assembly/test-out.fasta"
user>