haskell / happy

The Happy parser generator for Haskell
Other
279 stars 84 forks source link

`Fin n` for array indexes? #313

Open Ericson2314 opened 4 days ago

Ericson2314 commented 4 days ago

Happy currently has a lot of "untyped" array indexing, where nothing stops one using an index for the wrong of e.g. tokens, non-terminals, terminals.

I haven't done finite sets in Haskell for a while, so not sure the state of things these days (still KnownNat and friends I suppose?), but I wonder if we should try using them. It would be a lot less scary to refactor the code if these indexes wouldn't be confused for one another.

sgraf812 commented 4 days ago

What utility would that offer? It certainly would incur a high complexity toll on the code base.

Ericson2314 commented 3 days ago

Well have you seen this https://github.com/haskell/happy/blob/4b59fb3ed20ff92b6f2b5ed1c9df81abe007603c/lib/grammar/src/Happy/Grammar.lhs#L120-L142 ?

in particular

In hindsight, this was probably a bad idea.

Makes me chuckle. Fin would remove a major cost from this, as Fin (a + b + c) won't get mixed up with Fin a, Fin b, or Fin c.

My trying to turn Name into a mere newtype in https://github.com/haskell/happy/pull/223 and somehow getting <<loop>> also makes me think the current code is quite fragile.

Ericson2314 commented 3 days ago

Also, it gives us a way to pull some metadata out of band. I am thinking of something like

data Grammar eliminator
      = Grammar {
              productions       :: [Production eliminator],
              lookupProdNo      :: Int -> Production eliminator,
              lookupProdsOfName :: Name -> [Int],
              token_specs       :: [(Name, TokenSpec)],
              terminals         :: [Name],
              non_terminals     :: [Name],
              starts            :: [(String,Name,Name,Bool)],
              types             :: Array Int (Maybe String),
              token_names       :: Array Int String,
              first_nonterm     :: Name,
              first_term        :: Name,
              eof_term          :: Name,
              priorities        :: [(Name,Priority)]
       }

to

type Name nStarts nTerms nNonTerms = Fin (4 + nStarts + nTerms + nNonTerms)

data Production = Production
      (Fin nNonTerms)
      [Name nStarts nTerms nNonTerms]
      -- IIRC these are indices of the previous list, 
      -- so could do more dep types 
      [Word]
      Priority

data Grammar nStarts nTerms nNonTerms
      = Grammar {
              -- No longer need eliminator type parameter
              productions       :: Vector nProds (Production nStarts nTerms nNonTerms),
              lookupProdsOfName :: Map (Fin nNonTerms) (Set (Fin nProds)),
              starts            :: Vector nStarts (Name, Bool),
              priorities        :: Map (Fin nNonTerms) Priority
       }

which is the core formal language part (?)

and

data GrammarExtras nStarts nTerms nNonTerms
    = GrammarExtras {
              prod_elims        :: Vector nNonTerms Text,
              token_elims       :: Vector nTerms TokenSpec,
              start_names       :: Vector nStarts Text,
              start_names       :: Vector nStarts Text,
              types             :: Vector nTerms (Maybe Text),
              token_names       :: Vector nTerms Text,
    }

has all the extras bits for human legibility and/or codegen, with guarantees from the type parameters that everything has the right arities.

Ericson2314 commented 3 days ago

From what I remember, I don't like the overhead of dependent stuff in Haskell either, but I am not sure how else lock down the data structures without changing around the terms more.

For example, one can do zip all lists/arrays of the same arity to ditch the coherence conditions, and use "expansion points" as needed to factor out the extra info", and that works pretty well, but the various index types we need to store (because it's a graph, at the end of the day) are not helped by this.

sgraf812 commented 3 days ago

In hindsight, this was probably a bad idea.

Makes me chuckle. Fin would remove a major cost from this, as Fin (a + b + c) won't get mixed up with Fin a, Fin b, or Fin c.

But you ultimately need Fin (a + b + c) because they all index the same array.

I still don't see the utility of using Fin n specifically. Why not SpecialName, TerminalName, NonterminalName? You would still need to know how many of each sort there are in order to compute array indices, but that should work just fine. It's pretty easy to check that no Name ever generated is out of bounds at generation time; no need to prove it in the type system (which would be very awkward to do).

Ericson2314 commented 1 day ago

Fair enough, I was thinking of https://github.com/haskell/happy/pull/223 as a stepping stone already, and you are rightly pointing out there are more newtypes we can create as further stepping stones.