Open kiritsuku opened 7 years ago
IntelliJ code inspections come to my mind: https://www.jetbrains.com/help/idea/2016.2/code-inspection.html. scalac linting is another use case.
Neither of them seem to need a fast response integration with an editor, so maybe OK to leave out from the PC. Refactorings are useful and important, however.
Dotty treats positions a bit differently from scalac. The architecture is as follows:
Ok, so I see two different cases:
What was said about how dotc trees work sounds reasonable, if that could be made available through an API I would be happy. Separating access to the trees from the PC is a good thing imho. Most people would get what they want from the PC, without having the burden to learn compiler internals. For static analysis one still would have to swallow the pill and use a compiler plugin, but at least the trees can't accidentally be exposed to the consumer of the PC as it is the case right now in scalac.
There is still the open question what to do with refactorings. The most useful refactorings like rename, organize imports, inline expression and extract definition can be made available without having to look at the trees. More complicated refactorings would benefit from accessing the trees but they are less important and most people can live without them. The mentioned refactorings could also be exposed through the language server protocol but it would be really hard to implement that in the compiler. There are two downsides fo the protocol that I can see right now:
Just saw that there's now Java support for VSCode: http://developers.redhat.com/blog/2016/09/19/java-language-support-for-visual-studio-code-has-landed/ with a server written in Java based on Eclipse JDT: https://github.com/gorkem/java-language-server
@smarter Eclipse Public License. Yuck.
should we add ensime's requirements to this ticket or should we create a new one? I'll add things as I remember them, or think of anything.
It's fine to add it here I think.
The ensime server uses the scalac API in a number of ways and for dotty support, we'd need replacements. It doesn't have to be a drop-in API as in many cases there is room for improvement.
ENSIME indexes all the binaries on the classpath using ASM to get everything Java knows about the classes, and also with scalap.
scalap gives some rudimentary information that can be used to augment our index, which is soon going to be backed by a graph database thanks to @sugakandrey and his work in GSoC and continued sponsorship from ensime users.
TASTY sounds like it is the ideal replacement for scalap and could be used to provide much more information than we ever have in the past.
One thing I'd like to encourage you to consider with TASTY is to store as much information as possible about positional information, so that it could be used to greater effect by debuggers. @chipsenkbeil has done a great job with the scala-debugger library, which is way ahead of the ensime client support. I can only imagine the things he could do if the TASTY format was to give lookup information from bytecode to individual closures, and hints about variables and where they are defined, etc etc.
I would hope that TASTY has a nice pure data ADT representation using sealed traits and case classes that doesn't require instantiating a compiler instance. This is critical... I work on huge codebases of 2million+ lines of code and when indexing, it has to be "parse and throw away" or the heap would just blow up.
The possibility for semantic search implementations here is huge. e.g. being able to search across the entire indexed project for type queries such as "Seq[T] => Option[T]" and so on. Also, tooling like dead code analysis would be able to use positional lookup to great effect.
We also do source code reconciliation in our indexing phase, based on name and some heuristics. If the bytecode / TASTY were to contain the relative path foo/bar/Foo.scala and line/column (or position) instead of Foo.scala:113 like it does nowadays, this would greatly simplify our source resolver.
The most common queries in ensime rely on obtaining the type or symbol at point and then doing something with it, e.g. looking up in the index to get the source and then the editor can jump to that source.
Again, remember about huge projects, it is really important that the PC can operate in a mode where it takes as much information from compiled binaries (imagine a workflow where the user compiles regularly) augmented by "source mode" only of the files that the user has open. This helps to deal with a lot of problems. With scalac, we have to restart the PC when enough binaries change.
If you could provide "jump to definition" support for source code files only (like the current PC does) that would be good, but it would be duplicated effort to go beyond this because we can already do it via our persisted index for the compiled binaries.
And of course, there is autocompletion and other features that require tree walking, such as being able to work out where the current block of code starts/ends (e.g. for expanding/contracting the region in emacs, or structured navigation).
One of the big areas right now that we have to deal with (and this is relevant for scalap/TASTY) is the translation between bytecode name, java name, scala name, scalac internal name, scalap name, and so on. We have convertors for everything into/out of FQNs and that allows us to convert from any to any. If dotty provided a mechanism for providing the FQN for everything, that would save us typing a lot of code! e.g. we use the FQN (which includes byte method signatures to disambiguate overloading) in the graph database to draw the links between what references what, and then we attach extra metadata such as the "scala name" onto those nodes.
We currently share the scala-refactoring library with scala-ide which has seen recent polish from @mlangc
The biggest problem here is that the scalac tree api is not source preserving, so if we perform manipulation at the tree level, there is no way to get back to the code the user typed in. Consider if you wanted to rename a variable in a for-comprehension, where the user has entered comments. The amount of hackery to deal with that right now is quite high and it doesn't really work that well for a bunch of examples.
We had a GSoC with @xeno-by on this area, but I think the problem ended up being a lot harder than expected, so we are without a solution.
My dream would be having a shared library of refactorings and autocorrect hints with scala-ide / tooling / language server implementations that allow many common cases to be addressed. e.g. the classic "you forgot to implement these methods"
A huge problem right now is that if the compiler goes into a loop, then we can't stop it even if the user is no longer interested in that file anymore.
It would be good to be able to cancel the PC, but also to be able to mark certain parts of code as "not analysed". An example of this is in typeclass derivation with shapeless. I'm considering a hack whereby we have a different implementation of the cachedImplicit macro, which does nothing in the PC except report back "this work has not been performed". That's a manual hack, but it would be good if we could detect these parts of code automatically and disable them.
We have http://github.com/ensime/pcplod now to help people write better macros and compiler plugins, but the reasons for failure are still a bit of a mystery. We're learning as we go.
For blackbox macros, the PC can deal with them easily... knowing the name and return type of a blackbox macro is all you need. But for whitebox, it can often mean that the syntax of the code doesn't even look like scala anymore, so being able to handle all the positional shifts is important. sbt's macros is an extreme example of this and something I'm investigating.
I've probably forgotten half of what we use the PC for. Hopefully @dragos and @rorygraves will expand or remind me. I'll try to keep this post updated as more ideas come to me.
@smarter ! Ping
Don't forget to read this :-)
I have ! :) I am also in the process of finishing a PR with a language-server implementation and a proof-of-concept high-level API to interact with the compiler. It is very far from addressing all of your concerns but it's a starting point!
Rescheduling to next release.
readers of this thread may also be interested in the advice I recently gave to the LSP working group (and, btw, ensime has a fully working LSP implementation as of 6 months ago): http://ensime.github.io/lsp-wg/
ENSIME is hopefully moving to use the semanticdb for indexing information, to avoid duplicating efforts between our indexer and the scalameta project.
If dotty provided semanticdb, or an API on top of TASTY that provided the same information as semanticdb, it would solve half the problem as it would mean the largest refactoring would be moving to use the new presentation compiler API. Assuming all the other language issues can be solved (such as the scalaz-deriving compiler plugin and blackbox macro).
If dotty provided semanticdb
Yes, we need a tasty-to-semanticdb generator, this should be relatively easy to implement in theory.
oh interesting. theoretically would a semanticdb-to-tasty generator be equally relatively easy?
As I understand it semanticdb only aims to provide a subset of the information found in tasty files.
Right, there's just not enough information in semanticdb to reconstruct trees that could be used for compilation
This contains discussion about the introduction of a PC in dotc.
Quote from @odersky (https://github.com/lampepfl/dotty/pull/1521#issuecomment-248040716)
IntelliJ code inspections come to my mind: https://www.jetbrains.com/help/idea/2016.2/code-inspection.html scalac linting is another use case.
About (a): I know that it is hard but I'm anyway doing it because I think it is the right thing to do. Using a PC in the IDE turned out to be a mistake over the years.
About (b): I know that it is not a concrete syntax tree but haven't looked at dotc in detail yet. Would you say that the information in dotc typed trees is on par with scalac typed trees, i.e. they are abstract but still contain correct positions to the source code? Because that is the only thing we really need. Refactorings are made more difficult with desugarings but that is ok, we also had to live with it in scalac. Potentially there is scala.meta which one day may be able to provide us a typed concrete syntax tree that we could adapt it in future.