lampepfl / dotty-feature-requests

Historical feature requests. Please create new feature requests at https://github.com/lampepfl/dotty/discussions/new?category=feature-requests
31 stars 2 forks source link

Presentation Compiler #27

Open kiritsuku opened 7 years ago

kiritsuku commented 7 years ago

This contains discussion about the introduction of a PC in dotc.

Quote from @odersky (https://github.com/lampepfl/dotty/pull/1521#issuecomment-248040716)

The compiler could provide another API, just to make refactorings or static analysis possible but then the situation is mostly the same as in scalac.

So it's mostly refactorings that are missing from the VSC protocol? Which static analyses do you have in mind which would be hard to do?

IntelliJ code inspections come to my mind: https://www.jetbrains.com/help/idea/2016.2/code-inspection.html scalac linting is another use case.

The problem with handing the tree out is that (a) trees are compiler-specific, so hard to abstract over. (b) the typed trees that the compiler produces are not very suitable anyway because they are heavily desugared from source, and the desugarings cannot easily be reversed. And, I see no reasonable way to change this. The essence of the Typer is that it maps untyped trees to typed trees, but the typed trees have a different structure from the untyped trees.

About (a): I know that it is hard but I'm anyway doing it because I think it is the right thing to do. Using a PC in the IDE turned out to be a mistake over the years.

About (b): I know that it is not a concrete syntax tree but haven't looked at dotc in detail yet. Would you say that the information in dotc typed trees is on par with scalac typed trees, i.e. they are abstract but still contain correct positions to the source code? Because that is the only thing we really need. Refactorings are made more difficult with desugarings but that is ok, we also had to live with it in scalac. Potentially there is scala.meta which one day may be able to provide us a typed concrete syntax tree that we could adapt it in future.

odersky commented 7 years ago

IntelliJ code inspections come to my mind: https://www.jetbrains.com/help/idea/2016.2/code-inspection.html. scalac linting is another use case.

Neither of them seem to need a fast response integration with an editor, so maybe OK to leave out from the PC. Refactorings are useful and important, however.

odersky commented 7 years ago

Dotty treats positions a bit differently from scalac. The architecture is as follows:

  1. There are untyped and typed trees. Both kinds of trees are immutable. The job of the typer is to map one to the other.
  2. Untyped trees are very close to source and have precise range positions. For instance, for-expression generators and filters are expressed as particular untyped trees.
  3. Typed trees are a desugared subset of untyped trees. An untyped tree may map to several typed trees (example: an untyped case class def maps into a class def and a module def of the companion object).
  4. Typed trees also have range positions which are copied from the positions of the untyped trees from which they are generated.
  5. There is a navigation API which works with the positions in order to:
    • map a typed tree to the untyped tree from which it is derived,
    • map an untyped tree to the set of typed trees that derive from it.
kiritsuku commented 7 years ago

Ok, so I see two different cases:

  1. An interface, which exposes basic functionality like code completion and error markers. This is what the PC can be for (and also what the language server protocol can provide).
  2. An interface, which gives full fledged access to the trees. This is what a compiler plugin could provide, no need to expose it through the PC.

What was said about how dotc trees work sounds reasonable, if that could be made available through an API I would be happy. Separating access to the trees from the PC is a good thing imho. Most people would get what they want from the PC, without having the burden to learn compiler internals. For static analysis one still would have to swallow the pill and use a compiler plugin, but at least the trees can't accidentally be exposed to the consumer of the PC as it is the case right now in scalac.

There is still the open question what to do with refactorings. The most useful refactorings like rename, organize imports, inline expression and extract definition can be made available without having to look at the trees. More complicated refactorings would benefit from accessing the trees but they are less important and most people can live without them. The mentioned refactorings could also be exposed through the language server protocol but it would be really hard to implement that in the compiler. There are two downsides fo the protocol that I can see right now:

  1. The protocol still needs to evolve and will evolve. Except for rename refactoring, it doesn't provide refactoring support. Other features like semantic highlighting and even fine grained syntax highlighting are missing.
  2. Enormous implementation efforts on the compiler side. As it was already said, one would have to reimplement some functionality of Ensime and there is a lot of functionality that could be added but that is not essential. That must not necessarily be bad but it would lead to additional maintenance efforts on the compiler side.
smarter commented 7 years ago

Just saw that there's now Java support for VSCode: http://developers.redhat.com/blog/2016/09/19/java-language-support-for-visual-studio-code-has-landed/ with a server written in Java based on Eclipse JDT: https://github.com/gorkem/java-language-server

fommil commented 7 years ago

@smarter Eclipse Public License. Yuck.

fommil commented 7 years ago

should we add ensime's requirements to this ticket or should we create a new one? I'll add things as I remember them, or think of anything.

smarter commented 7 years ago

It's fine to add it here I think.

fommil commented 7 years ago

ENSIME requirements

The ensime server uses the scalac API in a number of ways and for dotty support, we'd need replacements. It doesn't have to be a drop-in API as in many cases there is room for improvement.

scalap / TASTY

ENSIME indexes all the binaries on the classpath using ASM to get everything Java knows about the classes, and also with scalap.

scalap gives some rudimentary information that can be used to augment our index, which is soon going to be backed by a graph database thanks to @sugakandrey and his work in GSoC and continued sponsorship from ensime users.

TASTY sounds like it is the ideal replacement for scalap and could be used to provide much more information than we ever have in the past.

One thing I'd like to encourage you to consider with TASTY is to store as much information as possible about positional information, so that it could be used to greater effect by debuggers. @chipsenkbeil has done a great job with the scala-debugger library, which is way ahead of the ensime client support. I can only imagine the things he could do if the TASTY format was to give lookup information from bytecode to individual closures, and hints about variables and where they are defined, etc etc.

I would hope that TASTY has a nice pure data ADT representation using sealed traits and case classes that doesn't require instantiating a compiler instance. This is critical... I work on huge codebases of 2million+ lines of code and when indexing, it has to be "parse and throw away" or the heap would just blow up.

The possibility for semantic search implementations here is huge. e.g. being able to search across the entire indexed project for type queries such as "Seq[T] => Option[T]" and so on. Also, tooling like dead code analysis would be able to use positional lookup to great effect.

We also do source code reconciliation in our indexing phase, based on name and some heuristics. If the bytecode / TASTY were to contain the relative path foo/bar/Foo.scala and line/column (or position) instead of Foo.scala:113 like it does nowadays, this would greatly simplify our source resolver.

Typechecker

The most common queries in ensime rely on obtaining the type or symbol at point and then doing something with it, e.g. looking up in the index to get the source and then the editor can jump to that source.

Again, remember about huge projects, it is really important that the PC can operate in a mode where it takes as much information from compiled binaries (imagine a workflow where the user compiles regularly) augmented by "source mode" only of the files that the user has open. This helps to deal with a lot of problems. With scalac, we have to restart the PC when enough binaries change.

If you could provide "jump to definition" support for source code files only (like the current PC does) that would be good, but it would be duplicated effort to go beyond this because we can already do it via our persisted index for the compiled binaries.

And of course, there is autocompletion and other features that require tree walking, such as being able to work out where the current block of code starts/ends (e.g. for expanding/contracting the region in emacs, or structured navigation).

One of the big areas right now that we have to deal with (and this is relevant for scalap/TASTY) is the translation between bytecode name, java name, scala name, scalac internal name, scalap name, and so on. We have convertors for everything into/out of FQNs and that allows us to convert from any to any. If dotty provided a mechanism for providing the FQN for everything, that would save us typing a lot of code! e.g. we use the FQN (which includes byte method signatures to disambiguate overloading) in the graph database to draw the links between what references what, and then we attach extra metadata such as the "scala name" onto those nodes.

Refactoring

We currently share the scala-refactoring library with scala-ide which has seen recent polish from @mlangc

The biggest problem here is that the scalac tree api is not source preserving, so if we perform manipulation at the tree level, there is no way to get back to the code the user typed in. Consider if you wanted to rename a variable in a for-comprehension, where the user has entered comments. The amount of hackery to deal with that right now is quite high and it doesn't really work that well for a bunch of examples.

We had a GSoC with @xeno-by on this area, but I think the problem ended up being a lot harder than expected, so we are without a solution.

My dream would be having a shared library of refactorings and autocorrect hints with scala-ide / tooling / language server implementations that allow many common cases to be addressed. e.g. the classic "you forgot to implement these methods"

Cancellation

A huge problem right now is that if the compiler goes into a loop, then we can't stop it even if the user is no longer interested in that file anymore.

It would be good to be able to cancel the PC, but also to be able to mark certain parts of code as "not analysed". An example of this is in typeclass derivation with shapeless. I'm considering a hack whereby we have a different implementation of the cachedImplicit macro, which does nothing in the PC except report back "this work has not been performed". That's a manual hack, but it would be good if we could detect these parts of code automatically and disable them.

Macros and library testing in the PC

We have http://github.com/ensime/pcplod now to help people write better macros and compiler plugins, but the reasons for failure are still a bit of a mystery. We're learning as we go.

For blackbox macros, the PC can deal with them easily... knowing the name and return type of a blackbox macro is all you need. But for whitebox, it can often mean that the syntax of the code doesn't even look like scala anymore, so being able to handle all the positional shifts is important. sbt's macros is an extreme example of this and something I'm investigating.

More...

I've probably forgotten half of what we use the PC for. Hopefully @dragos and @rorygraves will expand or remind me. I'll try to keep this post updated as more ideas come to me.

fommil commented 7 years ago

@smarter ! Ping

Don't forget to read this :-)

smarter commented 7 years ago

I have ! :) I am also in the process of finishing a PR with a language-server implementation and a proof-of-concept high-level API to interact with the compiler. It is very far from addressing all of your concerns but it's a starting point!

DarkDimius commented 7 years ago

Rescheduling to next release.

fommil commented 6 years ago

readers of this thread may also be interested in the advice I recently gave to the LSP working group (and, btw, ensime has a fully working LSP implementation as of 6 months ago): http://ensime.github.io/lsp-wg/

fommil commented 6 years ago

ENSIME is hopefully moving to use the semanticdb for indexing information, to avoid duplicating efforts between our indexer and the scalameta project.

If dotty provided semanticdb, or an API on top of TASTY that provided the same information as semanticdb, it would solve half the problem as it would mean the largest refactoring would be moving to use the new presentation compiler API. Assuming all the other language issues can be solved (such as the scalaz-deriving compiler plugin and blackbox macro).

smarter commented 6 years ago

If dotty provided semanticdb

Yes, we need a tasty-to-semanticdb generator, this should be relatively easy to implement in theory.

dwijnand commented 6 years ago

oh interesting. theoretically would a semanticdb-to-tasty generator be equally relatively easy?

gsps commented 6 years ago

As I understand it semanticdb only aims to provide a subset of the information found in tasty files.

smarter commented 6 years ago

Right, there's just not enough information in semanticdb to reconstruct trees that could be used for compilation