gkappler / CombinedParsers.jl

Compiled parser combinators and regular expressions in pure julia
MIT License
77 stars 10 forks source link

Interop with Automa.jl #28

Open jakobnissen opened 3 years ago

jakobnissen commented 3 years ago

Hello Gregor! I'm the current maintainer of Automa.jl (by accident, I didn't write it, but ended up maintaining it, for now). You mentioned potential integration/collaboration between Automa and CombinedParsers (CP) and I think that sounds like an excellent idea!

The core limitation of Automa is that is uses FSMs, and there are many formats that just can't be parsed using FSMs - JSON, Newick, Julia code, and so on. There has been a longstanding desire among some users of Automa to create an Automa-like package for pushdown automata in order to parse context-free grammars. But it seems no-one who wants it has the skills and knowledge to implement it, so CP is probably the best bet for a good parsing package for that use case. I'll try out CP in the near future and see if it's a good fit for the BioJulia community :)

As I see it, CP and Automa are complementary: Automa is faster and more generic (in that it executes arbitrary code), but its limited to regular grammars, and its user experience is abysmal. I'm slowly pushing towards Automa v1.0, where I want to improve the interface: Make it simpler to use, have better debugging and so on.

What did you envision with integration? Perhaps CP can leverage Automa for some of its codegen? Automa could perhaps be automatically used to create smaller functions that CP can call when parsing. Like, CP could automatically find smaller regular patterns, and create mini-parsers using Automa, which it then calls itself.

While this should be possible in principle, and should make CP faster, I don't know if it's possible practically. Let's see.

If you need any information about Automa internals, let me know. For what it's worth, Automa does sort of support UTF-8, actually. It views non-ascii characters as a pattern of concatenated bytes.

Looking forward to hearing from you!

gkappler commented 3 years ago

Hi Jakob,

thanks for reaching out! I tried Automata for my purposes, but since my need really was user experience first, I decided to dive into julia compiler optimization with the CombinedParsers project.

I would love if Automa.jl could compose with CombinedParsers.jl! I have thought a bit about it and see 2 principle ways:

  1. Exactly like you wrote: Having CP parsing first (outside) using Automa on inner parts.
  2. the other way around, if you see any way for doing that?

Regarding route 1 it should be possible to determine parts of CombinedParsers that can be expressed as Automa FST and generating code with deepmap_parser transformations (which is not result transformation, but used for logging and reversed Lookbehinds). (I took a bit longer with the reply as to push cleaned up/refactored revision of that code. Update release follows soon. It had lots of hacked together dispatch cruft. Making the package PCRE compatible lead to a lot of bottom-up spaghetti development. I still feel the urge ot clean up and decompose fruitful abstractions as the package matures.)

Regarding code-gen, CP fully relies on julia @generated functions and compiler optimization. But the CP representation could also be used for code-gen of Automa. That way one could use CP syntax for user experience. I am interested in your experiences with CP and your requirements!

Regarding non-context-free grammars, CombinedParsers provides FlatMap, That piece is not optimized with @generated functions yet, so CP will fall back considerably compared to performance you are used with Automa. It would be great to find resources to optimize that part! Also, I stuck with the fastparse scala name FlatMap, that I am not perfectly happy with.

I was looking into the use of stacks in parsers, also for handling left-recursion in ENBF. Method dispatch opens up quite a number of options, and so far I went with Dictionaries instead of a stack: Both require mem allocation anyways and Dict allows lookup of states accross branches.

What do you think? What are your requirements? As time budget allows, I will gladly help!

I am looking forward to your thoughts! (I corrected my mistake about UTF8 in my manual, my apologies!)