Interactive codegen and refactoring

idris-lang / Idris-dev

A Dependently Typed Functional Programming Language

http://idris-lang.org

Other

3.41k stars 642 forks source link

Interactive codegen and refactoring #3542

Closed mmn80 closed 7 years ago

mmn80 commented 7 years ago

I am working on a new C RTS that is capable of code hot loading and I need to be able to compile individual Idris modules quickly. Unfortunately keeping an idris process in Ide mode only affords interpretation, but if I try to compile code it will fork a separate codegen process which has to load and grind all IBC files, including the Prelude (min. 2 sec on my machine), defeating the purpose of Ide mode for this scenario.

The current C backend does not even spawn the external process, but rather does it inline, as it should, I'd argue. In fact there is an obvious bug in the Main.hs of idris-codegen-c from which I draw that no one ever used it before.

[EDIT: see new proposal based on cabal packages in comments]

I understand the reason for keeping the generators separate (easy maintainance), but I propose a different solution, one that upholds the interactive property of Ide mode: git submodules (also used by GHC). I never tried them before, but from what I read they were created specifically to address this problem, and look like a sort of sandboxed, symlinked, foreign repo-folders. They address the security issue by demanding end users to manually pull their contents independently from the host repo's.

We could add cabal flags to completely disable compiling these subfolders and could even automate all git management commands, to make it seamless for users. After setting up the build system (basically adding hooks into our Setup.hs toward some Makefiles in codegen's root and deferring all further responsibility to the codegen, plus the git and flags stuff), there will be no more work required from Idris maintainers, other then adding a few lines of code every time a new codegen wants in.

An advantage for this proposal is that only minor modifications are warranted to Haskell source (other then build scripts), since the current C backend already works this way, as mentioned above. We just need to make a list where generators register functions like codegenC which are called by generate. Another advantage is that we could immediately move the existing codegens to external repos and speed up compilation this way for people who don't need them.

I know that Edwin does not care about modules, neither ML nor Erlang style, and I'm only writing this sentence in order to raise the number of different meanings of the word "module" in this post to 5.

mmn80 commented 7 years ago

I forgot to say that I am ready to do this work myself, but I need to know if the idea is accepted. Maybe there is an obvious issue with it. Otherwise I'll have to do my thing on a fork instead.

mmn80 commented 7 years ago

I found the older refactoring thread: #1539.

New day, new plan: scrap the git submodules and do it with packages, but in a maximally general and minimally disruptive way:

an idris-core package that will contain everything except codegens and the main executable. All build complexity, including docs and tests, will move to idris, leaving idris-core a simple, pure cabal package, easy to refactor or depend on for other projects. The environment variables defined in IRTS.System will be abstracted (eg: uninitialized IORefs) and instantiated by idris. All the rest of Haskell files remain unchanged.
various codegen packages that depend on idris-core. Note that codegens should sit close to front rather then back, since someone might want to generate Agda from surface syntax, for example, or use other high level hints to drive generation.
idris depends on all of the above and its main purpose is to integrate and build the executable, libs, tests and docs. Dependencies on codegen packages are under cabal flag and a module is added to idris to conditionally import the codegen functions. Other languages built on top of idris-core will have to do the same as idris, but they will have access to all the generators for free (as long as the AST stage they build upon is compatible with the generator's, which is just a matter of type checking).

I believe that this would be the right way for codegens, and it's an incremental step that enables people to further split idris-core later, in line with the concerns expressed in #1539. All current codegens will remain in this repo, while new ones can be registered with a few lines of code. Moreover, this architecture supports the interactive compilation I was blabbering about in the OP.

mmn80 commented 7 years ago

I think I'll leave this lively discussion for now and go ahead and implement a PR. This way, I'll discover precisely what other hidden cyclical dependencies lurk inside, and we'll be able to discuss more concretely.

ahmadsalim commented 7 years ago

@mmn80 Sorry for not getting back to you sooner. Am I understanding correctly that the overall goal is to separate the code generators from the rest of idris to have a more modular design?

PS: It would be nice if you summarized your design using a bit shorter text. I am having a bit hard time following the proposal.

mmn80 commented 7 years ago

Sorry for the long text. Only the last comment contains the proposal in its entirety. I changed my mind after reading #1539.

Design:

the idris-core package contains all Haskell code, has a Simple build, and the paths and all other environment vars are abstracted (to be instantiated by frontends)
codegens completely split out (depend only on idris-core)
frontend (idris) depends on idris-core and on all codegens enabled by users with cabal flags; it contains no code other then Main.hs, just builds executable, plus tests, docs, and libs

Goals:

all desired codegens are built into the main executable and can be used from REPL/IDE/RTS with no lag
idris-core and codegens are modularised and can be reused for new frontends and new languages
much easier in the future to refactor idris-core further if desired (as mentioned in #1539)

This isn't much shorter, but I hope is clearer.

jfdm commented 7 years ago

Hi @mmn80 thanks for posting the proposals.

Certainly interesting. IIRC #1539 was abandoned as the proposer found an alternate route for their research, and refactoring the internals for better organisation was not an immediate priority.

Looking forward to seeing how this pans out, and how easy it is to maintain and affect for future iterations of Idris.

Ultimately, I think this will be up to @edwinb to decide, as primary maintainer and releaser.

edwinb commented 7 years ago

The way I did it was the simplest way that allows people to write new codegens without the Idris core needing to know about them - we used to have four different ones and it became something of a maintenance burden. Anything that keeps this property is fine especially if it results in better organisation.

By the way, I know it's not relevant to the proposal, but I don't know where you get the idea that I don't care about modules. I have said in the past that it's not a research problem I'm interested in, but my interests change, and not being interested is not the same as not caring... Idris interfaces are remarkably close to ML modules, in any case.

mmn80 commented 7 years ago

Hi @edwinb. I did not want to go into details about my live coding RTS project in order to keep the post short, and also because it might fail. With that poor attempt at a joke I tried to communicate a bit (it's about Erlang, rather then ML style modules), but I should have known not to use loaded language when talking to strangers over the Internet. In fact, this is even mentioned in the CONTRIBUTING page. Sorry about that, I will refrain from such twisted speech in the future.

The work is progressing well, and I'll write a detailed PR when I'm done. I just hope that merging new commits from upstream would not be too painful, since the location of the files has changed. I made the commits such that git describes file migrations with renaming, rather then deleted and new file.

mmn80 commented 7 years ago

@jfdm suggested on IRC that I should describe here in more detail what the problems I'm trying to deal with are, in order to clear some confusions, and focus the PR more on the solution.

The topic is extension, compile and run time. idris has excellent support for compile time extensions, with powerful types, syntax rules, elaborator reflection, etc. But idris also has some runtimes: those in the C and JavaScript codegens, which are not the subject of this post, and the REPL/IDE server processes, which is what I mean by "runtime" below.

Issue: insufficient mechanisms for runtime extensions

IPC is not very good:

converting messages to Json and back is inefficient
in batch mode it requires expensive initialization: loading all Prelude IBCs, plus all that depend on the module I try to compile, even while they are already loaded in my idris server process; this is what motivated me in the first place
in daemon mode does not scale: requires either platform-dependent solutions like DBus+systemd, or turning idris into an OS-like master process, both hard to get right; idris alone is nice as it just has a standard port the editors know about
the versions of these tools can get silently out of sync and cause havoc on global data/cache

Current codegen system is good for what it is, a simple solution for an immediate problem, but, for the reasons above, is not suited as a general extension mechanism to be used effectively by 3rd party IDE/REPL tool writers, and the lack of process centralization is especially terrible for end users.

Solution: dynamic bindings

Simplest sustainable solution would be to have a way of registering tools to the runtime from within, since Haskell has built-in concurrency support and no marshaling would be needed. This requires global state (MVar at top level in a core library module).

Once put in place, the MVar can also be used to link extension threads up, so that they can freely interact with each other and the REPL. So a bit more like a system with dynamic bindings like the Elisp inside Emacs or the similar system they use at yi, except that the extension is written in Haskell, and, since we are actually inside an Idris, can even be Idris (passing Terms as messages, holding IState in the global dynamic state, using the evaluator). There can be a special Eff language for this editor stuff.

But this design creates a dependency problem that needs to be addressed (next issue).

Issue: conflation of runtime and library

This is a problem because runtime dependencies go backwards as compared to the compile time ones. My codegen needs to depend on idris the library, but idris the front end needs to depend on my codegen, since otherwise it would no longer be the front end. This pushes the frontend upfront, and forces users to create a host of batteries-included bundles, which is inconvenient.

Separation of the two allows creation of packages in between, and the default frontend can easily update a selection of bundled, but disabled by default, extensions.

Additionally, it simplifies maintenance of forks of either of the two, creation and maintenance of new front ends (eg: alternate Prelude), and refactoring of the library, which will have a simplified build system.

Issue: bundled but unpackaged extensions

Since the parts (codegens, etc) of the current package are conceptually distinct, situations arise during building or testing when they do need to be treated differently. So the build system already has this complexity within, except it's somewhat implicit, harder to debug and can more easily cause maintainers to make mistakes and break the build. For new, potential contributors, looking at the unified cabal or Setup.hs file and understanding what belongs to what is an extra cognitive burden.

But packaging the codegens while also keeping them bundled forces us to deal with the dependency issue above and to do the split regardless of other concerns.

Conclusion

So I think the confusion around this project, both in terms of communication with you guys, and some mistakes I've made, comes from the fact that I try to do a number of different things at the same time. One is introducing dynamic extension hooks in the REPL, which leads to design decisions that have impact on the package structure, and others are general refactoring issues that I picked up from the old refactoring thread and also from what I've seen throughout the codebase. And this while my main motivation is just learning some new tricks.

I hope that the repo will end up improved for maintenance, extension, and just picking up by new users.

ahmadsalim commented 7 years ago

@mmn80 Could you please provide a couple of examples of what kind of runtime extensions you are interested in?

mmn80 commented 7 years ago

Well, upon further reflection I no longer believe this doubling down on the Haskell side is that much worthwhile.

So I was thinking of integrating Haskell packages such as yi or diagrams as a set of dynamically registered commands in the REPL/IDE of a type like SExpr -> Idris SExpr, rather then just SExpr -> IO SExpr or the current external codegen's Args -> IO (). An Idris mode for yi, for example would have to depend on both idris-core and yi, so the refactored architecture would allow for 3rd parties to implement such a package, to be bundled as a disabled by default dependency in the idris package, leading end users to install idris like this:

stack install idris --flag --idris:codegenC --flag --idris:yi --flag --idris:diagrams

and resulting in a single executable where the editor, Idris repl/ide and other things would benefit from GHC's shared memory concurrent runtime, and combined type checking. The :e command would switch modes instantaneously with no state loss on either side, etc. Other examples: IHaskell or Haskell for Mac style environments, interactive charting or code visualization.

But considering that the Idris REPL is already runtime extensible on the Idris side (the table of symbols :x can take as arguments can be dynamically updated with :l) this direction would eventually lead to an Idris to Haskell code generator and FFI. We would write these libraries in Idris, then benefit from the shared memory runtime in one of a few ways:

Create and bundle into idris 3rd party packages as before, except that they contain mostly Idris code, and on a build hook we would use the Haskell codegen to generate the Haskell containing the Idris REPL symbol registration calls.
Change the :l command to generate Haskell code beside the Main.hs of idris, modify Main.hs to import it, and use a bundled dyre to recompile and reload.

The alternative on the Idris side looks more promising. We would essentially implement the -fobject-code flag of ghci, that would change the :l command to compile Idris to C, create a shared library and link it into the executable. We would need to include 2 SExpr marshaling functions in the C and other 2 in Haskell, and ensure that all functions are declared for export. Even codegens could be eventually implemented in Idris, at least as Haskell wrappers (the specialized FFI needed here would be much easier to do), and just called like :x codegenC context, where context would be a built in binding.

So let's compare:

the Haskell codegen certainly requires more work
I cannot tell which among the gcc+ld and ghc+dyre will win the slowness award, but I suspect the latter
the Haskell solution is more platform dependent, since C FFI targets a larger pool of libraries
especially in the future of Pi types in GHC, where the codegen could translate the surface AST more directly, the generated code would benefit from most of GHC's optimizations, but by that time Idris too could have more of these implemented
for the shared heap situation, the codegen functions were really special in that they got as argument pretty much the entire Idris state as a [SDecl], but for must other things compile vs. interpreted would matter more, for speed, then shared vs message passing
on the Haskell side, you need to implement dynamic linking as well, to keep the graphics alive

In conclusion I'll restate the problems this refactoring addresses, after acknowledging that the entire "insufficient mechanisms" section in the post above is pretty dumb:

it allows for as efficient as possible Haskell written codegens, that can be easily bundled into idris, reducing the maintenance burden of a special IPC protocol and Main.hs
moving the existing codegens into separate repos if desired
better code organisation, less potential for build breakage
further refactoring of idris-core to achieve other goals (like those mentioned in the old refactoring thread), more customisable front end builds
alternate frontends (eg: alternate Prelude, the dyre system mentioned above), in general bundling things the main Idris does not accept for various reasons

mmn80 commented 7 years ago

Status update

The refactoring work is now a minimalistic conservative extension. Everything that used to work, including the external process codegens written in Haskell, will continue to work the same, but now you will also be able to publish interactive codegens and Idris libs that will be integrated by the frontend.

In order to minimize the number of files that need to be changed (for easier reviewing of the PR and decreased chance of mistakes), I'm still using a very basic MVar based inversion of control scheme implemented entirely in a centralized location, which is IRTS.System. I've been very strict about this and deferred any further refactoring work for future PRs. But I also thought more about the general problem.

Ideas for the future

After re-reading Dan Piponi's Quick and dirty reinversion of control, I realized that a general solution for such extensions (which I won't be implementing as part of this PR) would be adding ContT as a new layer to the Idris monad. Even better, pipes (or machines, which are CPSed by default for performance reasons), has a more sophisticated vocabulary that lets you to not just yield and await, but also request and respond, as it turns out all of these are useful. But let me explain in more detail.

The Idris monad based code in idris-core will be, in the language of pipes, a Client, while the fronend's Main.hs will connect it to a Server. The client will make requests to the front end using a simple ADT:

data IdrisRequest = ExecCodegen CodegenInfo
                  | GetCRTSInstallDir
                  | GetIdrisLibInstallDir LibName
                  | ExecREPLCommand CommandName CommandArgs
                  | ExecIDECommand CommandName CommandArgs
                  ...

For instance, the generate function will say request (ExecCodegen cg), and other parts of core will say dir <- request GetCRTSInstallDir insted of just dir <- getDataDir with its baked in import Idris_paths.

The front end's Server will then dispatch these requests to the appropriate implementation package in its build-depends and return results into a corresponding IdrisResponse ADT. The front end package contains nothing in it but this Main.hs and the bundle of features in its build-depends.

It will be very easy for other people to maintain batteries-included forks of the standard idris front end, since idris will almost never change, so no merging work will be needed. In fact, IDE tools could even automatically generate from a template such a package based on user selections in a UI, and by scanning dependencies in the ipkg, and create a custom idris executable for each project. Hackage could be used as an Idris package manager with very little effort. I do not want to argue with anyone whether we want this or not, I'm just mentioning it as a possibility.

From the constructors in the type above you can tell what extensions I thought of. 3rd party packages will be able to export codegens, Idris libs and REPL/IDE commands (the current built in ones can be lifted into a separate package as well), and all of them will be integrated by the front end.

Comments

But even with just the very quick and very dirty solution I'm working on, the idris-core package will become lifted / platform-independent to a much larger degree then before. It will also maintain its integrity, since all current Haskell code will be in it, and not hamper language research by creating inconvenient package boundaries that other, more intrusive, refactoring projects might involve.

This refactoring focuses exclusively on disentangling the build system across the conceptual boundaries that already exist, but lays the ground work for more interesting language development related projects.

If anyone somehow managed to get through this whole wall of text, I would be really interested in your feedback.

ahmadsalim commented 7 years ago

@mmn80 Thanks for the update, we appreciate the effort! If you have a version that is conservatively working with Idris please open a PR as soon as possible, so that we can review the code and perhaps merge it before there becomes too much bitrot.

mmn80 commented 7 years ago

I still have things to do. I will send a PR only when I get it to actually work, which will be sometime next week, hopefully. I didn't have the curiosity to try and merge new commits, but I'm sure git will fail me once again and I'll have to do everything manually.

mmn80 commented 7 years ago

PR opened.