Better parsing for extracting autocomplete/hint options from source code

cdsmith commented 5 years ago

There's some existing code for finding type declarations in the current module, and augmenting the autocomplete list with them. @nixorn is extending that to his new doc-on-hover feature, as well. However, the extraction code isn't all that great right now. Specifically:

It only looks at one line at a time. This generates incorrect results when the type declaration is wrapped across lines.
It mis-parses some syntax. For example pic1,pic2 :: Picture will add a single item called pic1,pic2, when it should add separate items for both of the two pictures.
It would be nice to extend the parsing to recognize definition lines (like f x = sqrt x + 1). Type inference isn't needed here, but I think it would be great if that gave f an incomplete type in autocomplete and docs, like f :: ? -> ?. Of course, an explicit type signature should replace the incomplete definition with a complete one.
When extraction does succeed, it still leaves unnecessary formatting in place. So if the original source code says:

short :: Number reallyReallyVeryVeryLong :: Number

the extracted doc for short will have all those spaces between short and ::, even though those extra spaces aren't sensible in the new context.

One must probably resist the temptation to completely reimplement the Haskell parser in JavaScript, but I think enough special cases could go a long way.

cdsmith commented 5 years ago

Just to record this, I did think about the possibility of using the GHC API on the server to extract type information. This would support any syntax corner case with no effort, and would even give fully inferred types, which is great. However, there are two reasons this wouldn't work so well:

It's a lot of work. Step 1, for instance, would be getting codeworld-base to build with plain GHC instead of GHCJS in the first place.
It's potentially expensive for the server. That means rate limiting, which means docs and autocomplete can get very out of date.
Worst of all, it only works when the whole module is syntactically correct. A single syntax error on the other side of the module would break autocomplete everywhere. I wouldn't consider this acceptable.

As a result, this is an idea that would, at best, be an occasional side benefit on top of the client-side implementation that we'd need anyway. Then the only huge benefit is the inferred types, since the rest of this should be handled on the client anyway. And inferred types are nice, but I'm actually kind of okay with students seeing question-marks in their autocomplete, and telling them they should just write a type declaration to see the whole type.

So I think this should just be rejected, and an implementation on the client should be the focus.

cdsmith commented 5 years ago

More examples to fix:

{- ==================
Some block comment
================== -}

The first line will match and add a hint for {-, thinking that it's being defined. Oops.

Also:

a*&*b = a + 2 * b

This is actually a definition of the *&* operator, but will be misparsed as a definition for the symbol a*&*b.

Definitions that are split across line breaks, like

f(x, y)
  = x + y

or

data
    Foo

These are missed because parsing only looks at one line at a time.

cdsmith commented 5 years ago

Once we have a good industrial-strength parser in place, we should use it for not only parsing the user-entered code, but also for parsing the builtins from hoogle-format source. There would need to be some kind of pluggable documentation-handling, because hoogle-format has docs already in HTML. By contrast, user-entered code is typically documented in plain text with some markup, and should also be augmented by the line and column where the definition occurs.

cdsmith commented 5 years ago

Another improvement that's needed is to add locally declared data constructors to the auto-complete list.

nixorn commented 5 years ago

I hope is possible to make with PEG.js something like this. I going to try.

nixorn commented 5 years ago

Test program https://gist.github.com/nixorn/a7b18badb8375e5d506a1542e7dd1937

cdsmith commented 5 years ago

One more comment on the test file. I teach data types using GADT syntax. So you'll also need to handle declarations like:

data Foo where
    Constructor1 :: Foo
    Constructor2 :: T -> Foo

This should actually be easier than non-GADT syntax, though, since the constructors that need to be added are just the lines after where with their correct type annotations already there.

google / codeworld

Better parsing for extracting autocomplete/hint options from source code #798