racket / rhombus

Rhombus programming language
Other
341 stars 61 forks source link

An approach to cyclic module dependencies #166

Open rocketnia opened 3 years ago

rocketnia commented 3 years ago

In Racket, two modules can't require each other. Any attempt to do so causes a cyclic dependency error. However, a #lang built on top of Racket doesn't necessarily have the same limitation.

Conceptually, let's say we have a module U that depends on modules B and C, while B and C are part of a set of modules {A, B, C, D, E} that's all tied up in cyclic dependencies. Visually, the dependencies look something like this:

A directed graph. Node U has dependencies pointing to nodes B and C, while nodes A, B, C, D, and E have some messy cyclic dependencies amongst themselves.

[These graphs are made with Geogebra.]

If we try to represent these dependencies directly with #lang racket requires, Racket won't have any of it. All of these modules will fail to compile.

In a new #lang, we can address this by putting a declaration in each module:

cyclic { A; B; C; D; E }

This declaration designates the first of these modules, A, as the host. In modules B, C, D, and E, the declaration tells them they're not the host, so they do basically nothing during compilation except wrap their syntax up in a bow with a "for A" gift tag on it. In module A, the same declaration informs A that it is the host, and it also tells A where all the other modules in the cycle are. So module A collects up all the syntax from modules B, C, D, and E, and it compiles all this syntax in one go.

If the client module U is still using #lang racket, there is a slight change needed to its code. It can't just require things from B and C directly, since those things are actually somewhere in module A. So module U needs to use a new variant of require. With this new variant, if the file it's importing from is a non-host member of a cycle, it looks at the gift tag to figure out where the host module is. Then it requires that module, in this case module A, and sifts through a bundle of metadata to find the actual names of the exports it's looking for.

With this approach, as far as the Racket compiler is concerned, there are no cyclic dependencies anywhere. Module U depends on B and C as before, but now it also implicitly depends on the host A, which B and C guided it to. Module A, the host, depends on modules B, C, D, and E, Finally, B, C, D, and E don't depend on anything, since they compile into simple redirection-and-source-code bundles.

An updated directed graph with the same nodes as before. Node U still depends on B and C, but now it also depends on A. Node A depends on nodes B, C, D, and E.

Slightly more convenient with a manifest

If we use a project manifest module (as described in https://github.com/racket/rhombus-brainstorming/issues/165), we can potentially put the cyclic declaration in there rather than copying it into all the individual modules.

Still, in small projects, I don't think copying and pasting it around will be much of a hassle. It does mean that a new module can't be added to the cycle without editing all of them.

Splitting this into cyclic_host_of { B; C; D; E } and cyclic_guest_of A declarations could make it more more convenient to add and remove guests, but I chose not to go with that. Picking the host of the cycle is a rather arbitrary decision, and visually verifying the consistency of these declarations across every module in a cycle is probably a lot easier when they all look exactly the same.

Why not host everything in the manifest module itself?

Partly because that creates a cycle. If module B determines whether it's part of a cycle by consulting the manifest, then the manifest can't turn around and determine B's source code by asking B for it.

More importantly, if we centralized all the compilation results into one place, then editing just one file in the codebase would force a recompilation of the whole codebase. A cyclic set of modules can't help but force each other's recompilation, but we can at least keep this contained by letting each cycle do its own hosting.

rocketnia commented 3 years ago

I've spotted an additional wrinkle to consider: Various tooling will need to know to follow guest-to-host redirects so that it can run configure-runtime submodules for REPL initialization, run test submodules, and discover binding arrows and other syntax properties that appear only in the compilation result (which is in the wrong module).

tgbugs commented 3 years ago

I spent an unreasonable amount of time disentangling circular dependencies when modularizing an old plt-scheme codebase that was not originally written with modularization in mind. The biggest pain point was the inability to get good debug info on what was causing the cycles. I eventually wrote a tool to see the require tree, but of course it won't work when you already have a cycle. This is of course a rare case, because usually you don't usually have hundreds of files with implicit dependencies that you suddenly want to modularize these days.

There are other cases where having a way to break cycles or allow cycles would be extremely helpful. The fact that modularization forces file boundaries to follow the technical requirements of the module system prevents users from organizing code in ways that might be easier to understand conceptually. This proposal could go a long way toward giving users more freedom in how they organize code into files.

slaymaker1907 commented 2 years ago

One problem with circular dependencies in a language like Racket is that they can and often do execute code when imported (i.e. at the top level of the module). Here is an example illustrating the problem:

(module foo
  (require bar)
  (provide fooy)
  (define fooy (bary 2)))

(module bar
  (require foo)
  (provide bary)
  (define baray (add1 fooay))
  (define (bary x)
    (+ fooy 1)))

If cyclic dependencies are allowed, I think said modules should probably be required to opt-in and that there should be restrictions on them. For instance, maybe they would only be allowed to define functions structure types at the top level.

rocketnia commented 2 years ago

@slaymaker1907 In the approach I'm proposing here, I actually don't expect that kind of example to work.

Incidentally, I think you meant (add1 fooy) where you wrote (add1 fooay), since fooay doesn't have a definition anywhere. I think you meant (+ x 1) instead of (+ fooy 1), too, since otherwise the definition of fooy accesses fooy before it's defined, assuming strict evaluation. But even if those were patched up, there's a more important reason I don't expect it to work.

In a Racket module, each outermost-level form can perform computations in each phase. They're run in a specific order, from first to last.

If Rhombus were a blank slate, I would recommend having all the declarations in a module expand concurrently, rather than in a particular order. To avoid race conditions, I'd recommend giving them access to only some selected side effects. (This is the direction I've been pursuing with Cene.) Then, the bodies of cyclic modules could also be expanded concurrently with each other in the same way, so they could make arbitrary references to each other that automatically blocked on each other in an intuitive way.

However, Rhombus isn't quite a blank slate. Rhombus's module system should interoperate with Racket's, in which modules are capable of arbitrary side effects. Instead of dodging race conditions by limiting side effects, Racket traditionally seems to dodge them by setting a norm that the compile-time world is single-threaded. (I've considered using concurrency at compile time in Racket in spite of this, since that's the kind of language I want to build, but certain parts of Racket's module system, such as the gensym name counter, use mutation in ways that I'm concerned wouldn't be thread-safe.)

So what I'd like to propose for Rhombus is that the cyclic { A; B; C; D; E } declaration is also a declaration of the order the files should be processed in (A, B, C, D, then E).

This means, to express an example like yours, I'd first try to decide between cyclic { foo; bar } or cyclic { bar; foo }. In some cases, neither order is ideal, and I believe your example is like that. Module bar's instantiation logic needs fooy from module foo to be initialized already, and module foo's instantiation logic needs bary from module bar.

Faced with that catch-22, I would use a workaround. I wouldn't shift any actual code around between files, but I'd put some of it in a function body so I could explicitly control when it evaluated:

#lang hypothetical-lang
; foo.rkt

(cyclic "foo.rkt" "bar.rkt")
(require (only-in "bar.rkt" bary private/fooy))
(provide private/make-fooy (rename-out [private/fooy fooy]))

; Instead of defining `fooy` here, we define a function
;
; We also provide `fooy` from this module, but the actual construction
; of `fooy`'s value happens in bar.rkt.
;
(define (private/make-fooy)
  (bary 2))
#lang hypothetical-lang
; bar.rkt

(cyclic "foo.rkt" "bar.rkt")
(require (only-in "foo.rkt" private/make-fooy))
(provide bary private/fooy)

(define baray (add1 fooy))
(define (bary x)
  (+ fooy 1))

; Now that `bary` has been defined, we construct the value of `fooy`.
(provide private/fooy (private/make-fooy))

This example shows an aspect of the design that I didn't originally dive into describing.

Using require to get things from cycle peers

A certain approach to declaring cyclic modules would just append all the files' syntax together and allow them to see each other's definitions.

But I think it's still good to keep dependencies explicit between files and allow each file to have its own unexported definitions that the others don't see. This would minimize the practical difference between files that are in a cycle and files that aren't, making it easier to refactor between them. So, in the above, I've used explicit requires and provides to communicate the private/make-fooy and private/fooy bindings between the foo and bar modules.

This means require would sometimes get things from another file in the same cycle, even though all such files' contents are compiled together in what's arguably a single module body. My approach to this would be to have each file's contents have a different scope in its set of scopes, and intra-cycle requires would somehow define variables with the current file's scope that were aliases for variables with other files' scopes. I'm not sure there's a really satisfying way to prevent files from requiring things that the other files don't actually provide, but we could take a #%top-like approach where, at the end of a module, if any intra-cycle require ended up not corresponding to an explicitly provided variable, there's a compile-time error.

(Perhaps another extension to the module system would allow bar.rkt to provide private/make-fooy in a way that only foo.rkt was allowed to require. But for this example, I've kept it relatively simple and used private/<...> as a naming convention to deter misuse.)

Cycles between submodules

Another slight quirk showing through in this example is that I changed this example to use files instead of (module ...) forms. I just hadn't considered cycles between things other than files yet, and they're a little awkward: The host depends on all the guests, and since Racket compiles submodules from first to last, I think the host needs to be the one that's declared last.

I suppose that suggests I made the wrong choice to have the host of cyclic { A; B; C; D; E } be the first module, A. Perhaps it should instead be the last module, E, so that submodules can be written in the same order they're listed in their cycle declarations. (I'd also be happy with an approach where the host is explicitly labeled as such in the declaration.)