github / semantic

Parsing, analyzing, and comparing source code across many languages
8.94k stars 453 forks source link

Question: Why graph and diff options were removed? #662

Closed Symbolk closed 2 years ago

Symbolk commented 2 years ago

Just out of curiosity, I found that a commit on Jun12, 2020 updated README and removed the semantic diff and graph option, as well as the dot output, it seems like a degradation, so I am wondering why, does it means that they are not needed anymore or moved to other options?

Symbolk commented 2 years ago

Screenshot_20211117-223931_Chrome.jpg

patrickt commented 2 years ago

Hey there! This library is currently in flux. The graph and diff algorithms were based on syntax trees using untyped, open sums, with syntax types generalized across languages, of the sort where

data Expr a = ...
data Lit a = ...
data Stmt a = ...

type YourLanguageHere ann = Sum ‘[Expr, Lit, Stmt] ann

This made it very easy to write tree summation operations, like to-JSON, with algebras:

class JSONAlg f where jsonAlg :: f Object -> Object

instance JSONAlg Expr ...
instance JSONAlg Lit ...
instance JSONAlg Stmt ...

yourLanguageToJSON :: YourLanguageHere Pos -> Object
yourLanguageToJSON = Sum.apply (cata jsonAlg) -- or something like this, I don’t remember exactly

Similarly, diffing was based on passing a Diff type as the type parameter to these syntaxes.

So why didn’t we continue this? A few reasons:

Firstly, writing data types by hand for all possible syntax types in a given language was really tedious. Furthermore, it was not clear when to use generalized syntax types and when to use language-specific ones. Consider that in Ruby, super has different semantics based on whether it has arguments passed to it or not. Should we have reused our standard Super syntax, or created a Ruby-specific one for the zero-arguments case? We went with the latter, but that kind of defeated the purpose of having generalized syntax types at all.) Furthermore, GHC did not like dealing with anonymous sum types that had hundreds of possible alternatives. We had to write our own fastsum library, which basically just unrolls loops in the type system, for compiles to even finish.

Secondly, there was extremely painful maintenance. Types do not come over the wire from tree-sitter in this open-sum format, they come in via the C FFI, which means we had to maintain compatibility layers called “assignments” that munged the C types into an open sum. These assignments were extremely complicated, tremendously fragile in the face of changes to tree-sitter grammars (and grammars change often), and painful to debug. The second reason is that an approach to syntax of the form

someExprFunction :: Lit (Sum ‘[Expr, Lit, Stmt] Pos) -> Whatever

is fundamentally untyped. If this hypothetical language disallowed statements as a kind of literal (which many do), that axiom is not reflected in the code. This means that operations on these syntaxes had to engage in constant casting and type-checking, none of which was checkable at compile-time (beyond safe projection out of Sum). Code written with this sort of casting is really frustrating to write, because you can never be sure if an operation is going to succeed or not, even though you know for a fact that it will based on the structure of the syntax trees.

So we now have a new formulation of syntax trees, generated automatically from grammar descriptions. These trees are not open sums, they’re normal Haskell datatypes, and their children are strictly typed rather than untyped. This fixes both our issues with such types. Unfortunately, it also opens some more cans of worms:

  1. Generic, polymorphic traversal of heterogenous data types in Haskell is difficult. You can’t use recursion-schemes, at least not without a lot of pain and suffering (check the compdata library for how it’s done, though the error messages and compile times are painful), nor can you use Uniplate/Scrap Your Boilerplate, because a function like jsonAlg is polymorphic, and Typeable doesn’t support that kind of polymorphism. What we ended up doing is this:
class Traversable1 c t where
  -- | Map annotations of kind @*@ and heterogeneously-typed subterms of kind @* -> *@ under some constraint @c@ into an 'Applicative' context. The constraint is necessary to operate on otherwise universally-quantified subterms, since otherwise there would be insufficient information to inspect them at all.
  --
  -- No proxy is provided for the constraint @c@; instead, @-XTypeApplications@ should be used. E.g. here we ignore the annotations and print all the @* -> *@ subterms using 'Show1':
  --
  -- @
  -- 'traverse1' \@'Data.Functor.Classes.Show1' 'pure' (\ t -> t '<$' 'putStrLn' ('Data.Functor.Classes.showsPrec1' 0 t ""))
  -- @
  --
  -- Note that this traversal is non-recursive: any recursion through subterms must be performed by the second function argument.
  traverse1
    :: Applicative f
    => (a -> f b)
    -> (forall t' . c t' => t' a -> f (t' b))
    -> t a
    -> f (t b)
  default traverse1
    :: (Applicative f, Generic1 t, GTraversable1 c (Rep1 t))
    => (a -> f b)
    -> (forall t' . c t' => t' a -> f (t' b))
    -> t a
    -> f (t b)
  traverse1 f g = fmap to1 . gtraverse1 @c f g . from1

With this, and some very complicated Generics code, you can automatically derive generic traversals for syntax types. Unfortunately, compile times are bad, and it gets worse when we consider what we really want out of these data types. They look like this right now:

data Block a = Block
  { ann :: a,
    extraChildren :: ([AST.Parse.Err (Statement a)])
  }
  deriving stock (GHC.Generics.Generic, GHC.Generics.Generic1)
  deriving anyclass
    ( forall a_176.
      AST.Traversable1.Class.Traversable1 a_176
    )

Note that the Err functor wraps the child statements, since tree-sitter can give us partial results, and we want to be able to handle invalid syntax trees. This all works for transformations like getting tags or building JSON, as those don’t require any structural changes to the syntax types in the process of transformation. But this functor is too restrictive if we want to do something like diffing, which would need, essentially, to turn that Err functor into Either. We actually want a formulation of type (* -> *) -> * -> *:

data Block f a = Block
  { ann :: a,
    extraChildren :: ([f (Statement a)])
  }

However, this means that we have to rewrite the whole Traversable1 hierarchy, into something like HTraversable1, that can handle the shape functor parameter. This has proven really, really difficult; the Generics manipulation required is extremely nontrivial, and none of the libraries like barbies are capable of handling the types contained in most syntax trees (where we do anonymous sums with :+:. So, in order to implement graphing or diffing, someone has to go in and rewrite that entire hierarchy, and frankly, it’s just hard as all hell. I’ve given it like five shots and still failed; Generic programming is tough enough for regular types, much less types of kind (* -> *) -> * -> *. (Perhaps an easier solution would be generating the HTraversable1 instances with the codegen system, rather than trying to do it via Generics.)

I hope this was helpful. Contributions welcome!

blamario commented 2 years ago

I've encountered the same problem with trees of heterogenous nodes, which led me to develop deep-transformations and a few language-specific libraries on top of it. I won't call it the solution but it doesn't have some of the problems you listed.

https://discourse.haskell.org/t/deep-transformations-and-more/1477