s312569 / clj-biosequence

A Clojure library designed to make the manipulation of biological sequence data easier.
77 stars 11 forks source link

Curious why you didn't use instaparse for parsing #28

Open metasoarous opened 9 years ago

metasoarous commented 9 years ago

Just a question: instaparse is spectacular, and I'm wondering why you haven't been using it. I've found that while trying to build tree parsers my instaparse solution was vastly simpler than a "from scratch" solution. It supports lazyness, and is supposed to be pretty fast.

EDIT: I mean specifically for Fasta; obviously, for XML based formats, XML parsers make the most sense.

s312569 commented 9 years ago

Ignorance basically :) I'll go and familiarise myself with it as you are right that from scratch is a pain in the proverbial.

averagehat commented 8 years ago

I'm interested in this, because I'm working on a clojurescript app which requires fasta parsing, and insta-parse has a cljs fork. Basically,

; wrapping in <> makes the element excluded from the parse tree
; space represent concatentation, everything else is like regex

; This parser allows a sequence to have arbitrary newlines in it but no other whitespace
; Naturally, this verifies the alphabet automatically 
; a regex which does not capture the newline might be better than the sequence rule,
; because it wouldn't require joining the string later, but regexes are greedy. 
(def raw-parse (insta/parser 
  "file : (record <'\n'>)* record <'\n'?>
  record : id <'\n'> sequence
  id : <'>'>  #'[^\n]+'
  sequence : ('A' | 'G' | 'C' | 'T' | <'\n'>)+")) 
(raw-parse 
">foo
ACTGATG
ACTGATG" )
;yields:
[:file [:record [:id "foo"] [:sequence "A" "C" "T" "G" "A" "T" "G" "A" "C" "T" "G" "A" "T" "G"]]]

Full example: https://gist.github.com/averagehat/5d6a94b0d052df8b6e1b

metasoarous commented 8 years ago

Yup; it's really pretty straight forward lovely to use. I started writing a Newick tree parser as well for this library, but never got around to committing it. Maybe I'll look at that again when I'm not busy raising a 7 month old.

s312569 commented 8 years ago

I remember playing around with instaparse around the time of the original comment and it is awesome but of course I have completely forgotten the results of that now.

Just to make sure we are on the same page, and I may be wrong here, but I thought that instaparse doesn't support readers? In which case the entire file would need to be read into a string before parsing. Some of the fasta files I routinely work on are pretty big and currently sequences are pulled lazily from a reader and shoved into a fasta record, which is then 'cleaned' individually.

If this is true I could use instaparse to enforce alphabets etc when 'cleaning' individual sequences but implementing at a file level would break lazy streaming of sequences from a file?

metasoarus: contributions are more than welcome but a 7 month old is a handful!

metasoarous commented 8 years ago

@engelberg - Perhaps you can comment here? I couldn't find anything in a cursory scan of the docs about parsing from a reader; is this possible? A while back I had read "Optionally produces lazy sequence of all parses", and assumed this was possible, but I realize this could still be true without operating on the input string/reader lazily.

My goal is to get said 7 mo up to speed quickly so I have help working on things. Progress on tmux has been pretty good so far. Next up, vim...

Engelberg commented 8 years ago

Instaparse supports strings and CharSequences, but not readers. Because of its sophisticated backtracking strategy and handling of ambiguity, it's not really a "read one character at a time" kind of parser. It moves forward and back through everything it has seen so far. You could do some hack to make it work from a reader, but effectively its going to end up caching the whole thing in memory at some point anyway. Also, instaparse is fairly memory hungry, so if you're dealing with files that you can't slurp into memory as a string, you certainly won't have enough memory to parse it.

So yes, I agree with the analysis that if these are enormous files, you wouldn't want to use instaparse at the file level, but if you have some way to break it into smaller chunks (like, say, <20k), it could potentially shine there.

metasoarous commented 8 years ago

Thanks Mark; that makes sense, and jives with my understanding of instaparse (I watched and enjoyed your talk a while back on the data flow approach).

Parsing FASTA files is really rather trivial, so it may not be worth the effort of chunking and then processing. Unless you want to get very granular about keeping information at the level of base pairs (nucleotide? amino acid? ambiguous? case?). For GenBank though, I could see that approach being handy.

For Newick tree files, chunking should also be a decent strategy. I usually see multiple trees separated by newlines, but I'm not sure if that's a de facto or de jure standard, or in general whether a line-break in the middle of a tree specification is tolerated, as it is with FASTA. Some more research to do there...

averagehat commented 8 years ago

One nice thing about the CFG method is that you get [nucleotide] alphabet detection and validation for free. This is particularly nice in the case of quality encodings: https://gist.github.com/averagehat/802af2bd08561f27bcda

I'm not sure what the performance overhead would be.

As for laziness, you should be able to chunk with or without newline semantics, using the :total parameter, you can stop your parsing in the middle of a sub-tree:

(raw-parse
">EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
GTTTC
>foo" :total true)

[:file [:record [:id "EAS54_6_R1_2_1_443_348"] [:sequence "G" "T" "T" "G" "C" "T" "T" "C" "T" "G" "G" "C" "G" "T" "G" "G" "G" "T" "G" "G" "G" "G" "G" "G" "G" "G" "T" "T" "T" "C"]] [:record [:id "foo"] [:sequence]]

You could then concat the leftovers with the next incoming chunk (or visa-versa). This isn't what :total is intended for, though. Actually, from the instaparse docs, I expected to see [:instaparse/failure ">foo"] (which could be prepended to the next chunk).