[cleanup][nexus] Break-up some of the megafiles

Related to https://github.com/oxidecomputer/omicron/pull/1013 , which split up the model.rs file into a directory.

Line count of some of the major files:

   3038 ./external_api/http_entrypoints.rs
   3685 ./db/datastore.rs
   4028 ./nexus.rs

This is subjective, but personally I find smaller files much easier to navigate - crossing the 1000+ LoC threshold makes it easy to introduce murky coupling, and makes it easy to create merge conflicts when working on fundamentally different portions of the stack.

Before we directly go and break up these files, @davepacheco brought up the question: do we want to restructure these things, rather than merely splitting into smaller files by related function?

Source:

Regarding nexus.rs: is it worth rethinking this one? The original idea was to have a sort of business logic layer between the datastore and the HTTP endpoints that could be shared with interfaces besides than the HTTP endpoints. The best case for something like this is the logic around instance_set_runtime(), which is non-trivial, unrelated to the HTTP API itself, and might be used elsewhere like sagas or Nexus background tasks.

At the same time it feels like there's a ton of boilerplate here where this layer isn't adding anything. For the various CRUD functions that are all just (1) use LookupPath to look up a resource and then (2) call a datastore function to do something with that resource, I wonder if we're better off doing that stuff directly from the HTTP entrypoints.

The datastore layer exposes a pretty low-level interface atop the database that attempts to ensure database-level integrity, but doesn't know about higher-level application-level concerns. Examples: the idea of an Actor or ConsoleSessionWithSiloId is specific to the authn subsystem. Maybe the way to break up Nexus is to have more of these subsystems, maybe for different groups of resources, like an "instance" submodule that would contain the equivalent of instance_set_runtime(), etc. This might be clearer than a monolithic layer like nexus.rs currently provides. This is also different from just splitting the existing Nexus functions into different files. There's still a question of what to do for things that really have no specialized logic around them that are just basic CRUD.

I agree on 1000 lines as a rough boundary where files start to feel unwieldy for the murky coupling reason.

Also, apologies to folks if I was too hasty on https://github.com/oxidecomputer/omicron/pull/1013 . Last night, I was adding a model to that class to represent a "Service" model, and I was frustrated because I struggled to identify "does this already exist or not" (and this is me! I contribute a lot to Omicron, I imagine it would only be more difficult for people who don't know the codebase as well)

By having everything in a single file, I had to either (1) guess the name of what it could be, or (2) read through the entire file. Again, this is subjective, but after refactoring I found "reading through the list of files" to be a much easier task.

Responding to some of @davepacheco 's thoughts on nexus.rs vs the HTTP entrypoints:

I personally like the distinction between "HTTP layer", "Database layer" and "Nexus's application (integration)" layer. I think each layer provides something distinct, and by narrowing responsibility, the complexity is reduced.

Maybe the way to break up Nexus is to have more of these subsystems, maybe for different groups of resources, like an "instance" submodule that would contain the equivalent of instance_set_runtime(), etc. This might be clearer than a monolithic layer like nexus.rs currently provides. This is also different from just splitting the existing Nexus functions into different files.

Maybe I'm not understanding, but how is this different from splitting the existing Nexus functions into different files?

Categorization

Looking at nexus.rs's usage of instances, I see the following categories of functions:

"External operation on container of instances (like project)"

https://github.com/oxidecomputer/omicron/blob/c8c7f34f2de7f88a543c74e1e075262374becdee/nexus/src/nexus.rs#L1418-L1433

"External operation directly on instance"

https://github.com/oxidecomputer/omicron/blob/c8c7f34f2de7f88a543c74e1e075262374becdee/nexus/src/nexus.rs#L1582-L1596

"Internal operation directly on instance"

https://github.com/oxidecomputer/omicron/blob/c8c7f34f2de7f88a543c74e1e075262374becdee/nexus/src/nexus.rs#L3289-L3346

"External operation using instance as a container (like attaching disks)"

https://github.com/oxidecomputer/omicron/blob/c8c7f34f2de7f88a543c74e1e075262374becdee/nexus/src/nexus.rs#L1926-L2073

Proposal

It would be intuitive to me to do the following:

Group all "container" operations in the parent. project_list_instances would exist in a projects.rs file. instance_attach_disk would exist in an instance.rs file.
Group all "direct" operations in a file named after the resource. instance_fetch, notify_instance_updated would also exist in instance.rs.

This would approximately result in the following:

nexus/disk.rs
nexus/iam_role.rs
nexus/iam_user_builtin.rs
nexus/image.rs
nexus/instance.rs
nexus/organization.rs
nexus/oximeter.rs
nexus/project.rs
nexus/rack.rs
nexus/session.rs
nexus/silo.rs
nexus/sled.rs
nexus/updates.rs
nexus/vpc.rs
nexus/vpc_router.rs
nexus/vpc_subnet.rs
nexus/zpool.rs

Responding to some of @davepacheco 's thoughts on nexus.rs vs the HTTP entrypoints:

I personally like the distinction between "HTTP layer", "Database layer" and "Nexus's application (integration)" layer. I think each layer provides something distinct, and by narrowing responsibility, the complexity is reduced.

I'd push back on that a little: I think for a lot of the CRUD operations, the "application layer" isn't adding anything at all. If you take a typical "get" endpoint, you've got:

an HTTP GET endpoint that calls a Nexus function
the Nexus function has no other callers
all the Nexus function does is call the lookup API, which accepts the same arguments (as a builder pattern) and returns the same thing as the Nexus function (a db model type)

There's no abstraction or encapsulation happening there. The responsibility of mapping names to resources could just as well belong to the HTTP layer, particularly given how well abstracted the lookup API is today. The application layer could just as well deal in concrete resources (resolved to ids already). And the HTTP layer already knows about the db models. This isn't a big deal but I think it does lead to confusion over what the Nexus layer really is responsible for and it also feels like 60% of the functions in Nexus are this kind of boilerplate.

Some of the "create" and "delete" functions in Nexus do do quite a lot more and they make the perfect case for having an application layer there. Project create and delete is a good example -- that one creates and deletes a lot of things related to a Project. I don't have a specific proposal here but I'm curious of others have also found all the boilerplate tedious and wondered if there's some way we can have a useful application layer where it's wanted without having to put all the simple cases behind a layer that doesn't add anything.

Maybe the way to break up Nexus is to have more of these subsystems, maybe for different groups of resources, like an "instance" submodule that would contain the equivalent of instance_set_runtime(), etc. This might be clearer than a monolithic layer like nexus.rs currently provides. This is also different from just splitting the existing Nexus functions into different files.

Maybe I'm not understanding, but how is this different from splitting the existing Nexus functions into different files?

To be clear, that's definitely an option.

The question I'm asking is whether any of those functions (or groups of functions) are better represented by richer application-level types. The case I have in mind where we've already done this is the authn submodule. You could imagine that if you wanted to know the silo for the current user, we could have a helper function like Nexus::current_silo or something. But I think the authn submodule provides a much more useful abstraction in the form of authn::Context -- it makes sure you've authenticated exactly once and gives you access to the pieces you might need (like the actor id) in a way that's not so easy to misuse as if it were all exposed directly (e.g., if the actor were an Option, it'd be easy to return the wrong status code when you expected it to be there and it wasn't). Similar Nexus helper functions have cropped up a few times in PRs related to silos and that's caused me to think more critically about where that logic belongs. I've generally pushed to move that stuff into the authn crate rather than have impl Nexus itself turn into a bag of random application-level logic.

It might be that most of the functions in impl Nexus today aren't that sort of helper, maybe because we have put such things into first-class subsystems as we did with authn. Most of these functions look like CRUD functions invoked directly by the HTTP layer, and I don't see much use in abstracting those further. (The obvious way to do this would be to have an application-level Instance type with methods to start it, stop it, reboot it, etc. I think this could be much worse than we have today and I don't see much upside -- you can't use it to enforce only valid state transitions because things can always be changed without you knowing it. I think such an abstraction would be really misleading.)

sled_allocate() is an example of something that's not a simple CRUD function -- it's pure application-layer logic, and it may be complicated enough to have its own submodule with types designed to help use it properly.

Last night, I was adding a model to that class to represent a "Service" model, and I was frustrated because I struggled to identify "does this already exist or not" (and this is me! I contribute a lot to Omicron, I imagine it would only be more difficult for people who don't know the codebase as well)

By having everything in a single file, I had to either (1) guess the name of what it could be, or (2) read through the entire file. Again, this is subjective, but after refactoring I found "reading through the list of files" to be a much easier task.

Sorry to hear about the frustration. I'm curious about the workflow question: how does having this in 45 files make it easier to find whether the type exists already? And is this something where fixing rust.docs.corp.oxide.computer (the autogenerated omicron Rustdocs) would help?

There's no abstraction or encapsulation happening there. The responsibility of mapping names to resources could just as well belong to the HTTP layer, particularly given how well abstracted the lookup API is today. The application layer could just as well deal in concrete resources (resolved to ids already). And the HTTP layer already knows about the db models. This isn't a big deal but I think it does lead to confusion over what the Nexus layer really is responsible for and it also feels like 60% of the functions in Nexus are this kind of boilerplate.

I'm split - on the one hand, I do agree that the current implementation is a little boilerplate-y, but even for a case with minimal "application-level" logic, the boundaries of responsibility help me understand "which layer is responsible for what".

For example, if we look at the "GET instances" call (a function with a minimal "application" layer):

HTTP Layer

https://github.com/oxidecomputer/omicron/blob/fe3cf7407eaed9a166b085cacd9cc9270b30c848/nexus/src/external_api/http_entrypoints.rs#L869-L903

Handles parsing of path parameters
Handles instrumentation
Acts as a converter from "db" type to "externally visible" types
Preps access to the right "page" of results by constructing a DataPageParams object

Application Layer

https://github.com/oxidecomputer/omicron/blob/fe3cf7407eaed9a166b085cacd9cc9270b30c848/nexus/src/nexus.rs#L1418-L1433

Looks up the project in the datastore, acquiring the authz::Project
Performs the project_list_instances operation on the datastore

Datastore Layer

https://github.com/oxidecomputer/omicron/blob/fe3cf7407eaed9a166b085cacd9cc9270b30c848/nexus/src/db/datastore.rs#L903-L919

Executes the authz checks
Constructs and issues the right DB query

Takeaways

I think there are some consistencies from the separations here:

The HTTP layer is always the layer that...
- ... performs parsing of paths and bodies
- ... converts from "db" to "external" types
- ... handles instrumentation
- ... constructs the DataPageParams object, if necessary
The application layer is always the layer that...
- ... creates LookupPaths
- ... communicates to external services (e.g., sends requests to sleds)
- ... creates sagas, if necessary
- ... calls into the datastore

IMO, if we started communicating directly to the datastore from the HTTP entrypoints - which we could do - "which area handles what" would become less clear to me. If the HTTP layer is talking directly to the datastore, could it also create sagas? Could it communicate to external services?

Idk. Maybe it wouldn't be so bad to allow the entrypoints to create LookupPaths and make simple calls to the datastore, but I guess I'd want to have a better idea of where we draw the complexity threshold line.

On your authz points, I totally agree - that's better served as a separate module. I think that's a great example of refactoring logic, but even in the CRUD cases in Nexus, I think it would be easier to grok the code if things were more organized by resource.

Here are some examples of the downside of a giant nexus.rs file:

If we try reading it as a logical narrative, we need to get to line 480 before we get to the first CRUD function. Worse, the "operations on each resource" are not next to each other. Between the internal / external API boundaries, container / direct object access, it hasn't been totally clear where to put what!

Here are the functions within the impl Nexus, as it exists today:

new_with_id. Okay, constructor. Makes sense.
wait_for_populate. Sure, seems related to the constructor.
opctx_external_authn. Auth helper, not a resource.
unimplemented_todo. Acts as a sentinel endpoint.
upsert_sled. Our first resource, from an internal API.
upsert_zpool. A different resource, here because it's part of the internal API.
upsert_dataset. Another different resource.
upsert_oximeter_collector. Another different resource.
register_as_producer. This is actually an internal helper function - not related to a resource.
build_oximeter_client. Helper function, used exclusively in oximeter-related functions in this file. Note: there are other functions to return oximeter clients ~3000 lines away, in next_collector.
oximeter_list: This looks like it could be related to an HTTP entrypoint, but I don't see it being used.
datastore: This is a getter function.
execute_saga: A helper function for Nexus
silo_create: A bunch of "Silo" functions...
silo_fetch: ...
silos_list_by_name: Note, this is on line 725...
(NOTE: idk which function this is, but it's much later) silo_user_fetch: This is on line 3803 - it's over 3000 lines away from the other silo functions!!! I understand that it fetches a user, but this distance makes it incredibly difficult to see "what does a silo contain".

This is a big reason why I'd advocate for "split up nexus.rs by resource" - to make helper functions more locally scoped, to make it clear what resources actually exist, and to make "Nexus utility functions" more distinct from everything else.

Saga execution is another good example - like authz - that isn't specific to a single resource, and can benefit from separate submodules. But by interleaving the saga helpers in nexus.rs with operations that do act on resources, it's actually fairly difficult to see where the saga-handling logic lives. For example, Q: is there saga handling logic in Nexus after execute_saga, on line 644? A: Yes! There are methods to get sagas on line ~3200.

I think your summary of what the layers currently do is accurate. I think it's also somewhat arbitrary. I'm not sure why moving one responsibility would call into question which layers are responsible for other responsibilities. Anyway, if nobody else is bothered by the boilerplate, I'm fine with dropping this because I don't have a great idea. I was just hoping to brainstorm a bit because the boilerplate seems bad to me.

On the more specific question about creating LookupPaths: I'm not convinced that belongs in the application layer. Consider when we go implement #485. If we continue with the current pattern, we'd create a second version of all of the Nexus functions that accepts an id instead of a sequence of names. The next natural step is to instead accept an enum of Id or PathOfNames. But that's exactly what lookup::Foo is. So I think it'd be much better to move the LookupPath creation into the HTTP layer and change the Nexus functions to accept lookup::Foo (as in, lookup::Organization). This example really highlights for me that the HTTP layer's job includes figuring out which application resource is meant by a URL path, not just syntactically parsing the name components out of the URL path.

Here are some examples of the downside of a giant nexus.rs file:

Yes, agreed. I think there's some confusion. I wasn't arguing that it's well-organized right now or that it shouldn't be split up. I was asking whether there are useful subsystems to be created out of the functions that are there, rather than partitioning the existing functions mechanically into files according to the resource they appear to operate on.

I'm curious about the workflow question: how does having this in 45 files make it easier to find whether the type exists already? And is this something where fixing rust.docs.corp.oxide.computer (the autogenerated omicron Rustdocs) would help?

I think using ls, seeing 45 files, and knowing that each corresponds to a resource I may want makes it much easier to find a resource when compared to parsing a ~3000 LoC file. It also helps me quickly skim over the "only-relevant-in-the-context-of-other-resource" objects, like the ProjectUpdate struct in the context of a Project.

Although I think fixing the rustdocs could help, I don't think it needs to be a mutually exclusive solution.

I think your summary of what the layers currently do is accurate. I think it's also somewhat arbitrary. I'm not sure why moving one responsibility would call into question which layers are responsible for other responsibilities.

I basically think it's just a matter of uniformity. I don't feel super strongly about this, but I think it's basically a "pros of less code" vs "pros of predictability" question. I definitely think either could work.

Your example with LookupPaths may be a good justification for modifying that layering. I know you've been removing some of the footguns related to datastore access without valid authz; maybe now it makes sense to be more permissible with access.

Yes, agreed. I think there's some confusion. I wasn't arguing that it's well-organized right now or that it shouldn't be split up. I was asking whether there are useful subsystems to be created out of the functions that are there, rather than partitioning the existing functions mechanically into files according to the resource they appear to operate on.

Gotcha - I think my opinion is that "if we pull the existing functions out by resource, it'll be easier to see what's left".

That could be doable like I did with nexus/src/db/model - if we had a nexus/src/app, I could move individual resources to a spot like nexus/src/app/instance.rs, and leave the remainder of nexus.rs in nexus/src/app/mod.rs, as a start? We could continue to pull out subsystems from that, if clear lines can be drawn?

oxidecomputer / omicron