Open smklein opened 2 years ago
I agree on 1000 lines as a rough boundary where files start to feel unwieldy for the murky coupling reason.
Also, apologies to folks if I was too hasty on https://github.com/oxidecomputer/omicron/pull/1013 . Last night, I was adding a model to that class to represent a "Service" model, and I was frustrated because I struggled to identify "does this already exist or not" (and this is me! I contribute a lot to Omicron, I imagine it would only be more difficult for people who don't know the codebase as well)
By having everything in a single file, I had to either (1) guess the name of what it could be, or (2) read through the entire file. Again, this is subjective, but after refactoring I found "reading through the list of files" to be a much easier task.
Responding to some of @davepacheco 's thoughts on nexus.rs
vs the HTTP entrypoints:
I personally like the distinction between "HTTP layer", "Database layer" and "Nexus's application (integration)" layer. I think each layer provides something distinct, and by narrowing responsibility, the complexity is reduced.
Maybe the way to break up Nexus is to have more of these subsystems, maybe for different groups of resources, like an "instance" submodule that would contain the equivalent of instance_set_runtime(), etc. This might be clearer than a monolithic layer like nexus.rs currently provides. This is also different from just splitting the existing Nexus functions into different files.
Maybe I'm not understanding, but how is this different from splitting the existing Nexus functions into different files?
Looking at nexus.rs
's usage of instances, I see the following categories of functions:
It would be intuitive to me to do the following:
project_list_instances
would exist in a projects.rs
file. instance_attach_disk
would exist in an instance.rs
file.instance_fetch
, notify_instance_updated
would also exist in instance.rs
.This would approximately result in the following:
nexus/disk.rs
nexus/iam_role.rs
nexus/iam_user_builtin.rs
nexus/image.rs
nexus/instance.rs
nexus/organization.rs
nexus/oximeter.rs
nexus/project.rs
nexus/rack.rs
nexus/session.rs
nexus/silo.rs
nexus/sled.rs
nexus/updates.rs
nexus/vpc.rs
nexus/vpc_router.rs
nexus/vpc_subnet.rs
nexus/zpool.rs
Responding to some of @davepacheco 's thoughts on
nexus.rs
vs the HTTP entrypoints:I personally like the distinction between "HTTP layer", "Database layer" and "Nexus's application (integration)" layer. I think each layer provides something distinct, and by narrowing responsibility, the complexity is reduced.
I'd push back on that a little: I think for a lot of the CRUD operations, the "application layer" isn't adding anything at all. If you take a typical "get" endpoint, you've got:
There's no abstraction or encapsulation happening there. The responsibility of mapping names to resources could just as well belong to the HTTP layer, particularly given how well abstracted the lookup API is today. The application layer could just as well deal in concrete resources (resolved to ids already). And the HTTP layer already knows about the db models. This isn't a big deal but I think it does lead to confusion over what the Nexus layer really is responsible for and it also feels like 60% of the functions in Nexus are this kind of boilerplate.
Some of the "create" and "delete" functions in Nexus do do quite a lot more and they make the perfect case for having an application layer there. Project create and delete is a good example -- that one creates and deletes a lot of things related to a Project. I don't have a specific proposal here but I'm curious of others have also found all the boilerplate tedious and wondered if there's some way we can have a useful application layer where it's wanted without having to put all the simple cases behind a layer that doesn't add anything.
Maybe the way to break up Nexus is to have more of these subsystems, maybe for different groups of resources, like an "instance" submodule that would contain the equivalent of instance_set_runtime(), etc. This might be clearer than a monolithic layer like nexus.rs currently provides. This is also different from just splitting the existing Nexus functions into different files.
Maybe I'm not understanding, but how is this different from splitting the existing Nexus functions into different files?
To be clear, that's definitely an option.
The question I'm asking is whether any of those functions (or groups of functions) are better represented by richer application-level types. The case I have in mind where we've already done this is the authn
submodule. You could imagine that if you wanted to know the silo for the current user, we could have a helper function like Nexus::current_silo
or something. But I think the authn
submodule provides a much more useful abstraction in the form of authn::Context
-- it makes sure you've authenticated exactly once and gives you access to the pieces you might need (like the actor id) in a way that's not so easy to misuse as if it were all exposed directly (e.g., if the actor were an Option
, it'd be easy to return the wrong status code when you expected it to be there and it wasn't). Similar Nexus helper functions have cropped up a few times in PRs related to silos and that's caused me to think more critically about where that logic belongs. I've generally pushed to move that stuff into the authn
crate rather than have impl Nexus
itself turn into a bag of random application-level logic.
It might be that most of the functions in impl Nexus
today aren't that sort of helper, maybe because we have put such things into first-class subsystems as we did with authn
. Most of these functions look like CRUD functions invoked directly by the HTTP layer, and I don't see much use in abstracting those further. (The obvious way to do this would be to have an application-level Instance
type with methods to start it, stop it, reboot it, etc. I think this could be much worse than we have today and I don't see much upside -- you can't use it to enforce only valid state transitions because things can always be changed without you knowing it. I think such an abstraction would be really misleading.)
sled_allocate()
is an example of something that's not a simple CRUD function -- it's pure application-layer logic, and it may be complicated enough to have its own submodule with types designed to help use it properly.
Last night, I was adding a model to that class to represent a "Service" model, and I was frustrated because I struggled to identify "does this already exist or not" (and this is me! I contribute a lot to Omicron, I imagine it would only be more difficult for people who don't know the codebase as well)
By having everything in a single file, I had to either (1) guess the name of what it could be, or (2) read through the entire file. Again, this is subjective, but after refactoring I found "reading through the list of files" to be a much easier task.
Sorry to hear about the frustration. I'm curious about the workflow question: how does having this in 45 files make it easier to find whether the type exists already? And is this something where fixing rust.docs.corp.oxide.computer (the autogenerated omicron Rustdocs) would help?
There's no abstraction or encapsulation happening there. The responsibility of mapping names to resources could just as well belong to the HTTP layer, particularly given how well abstracted the lookup API is today. The application layer could just as well deal in concrete resources (resolved to ids already). And the HTTP layer already knows about the db models. This isn't a big deal but I think it does lead to confusion over what the Nexus layer really is responsible for and it also feels like 60% of the functions in Nexus are this kind of boilerplate.
I'm split - on the one hand, I do agree that the current implementation is a little boilerplate-y, but even for a case with minimal "application-level" logic, the boundaries of responsibility help me understand "which layer is responsible for what".
For example, if we look at the "GET instances" call (a function with a minimal "application" layer):
DataPageParams
objectauthz::Project
project_list_instances
operation on the datastoreI think there are some consistencies from the separations here:
The HTTP layer is always the layer that...
DataPageParams
object, if necessaryThe application layer is always the layer that...
LookupPaths
IMO, if we started communicating directly to the datastore from the HTTP entrypoints - which we could do - "which area handles what" would become less clear to me. If the HTTP layer is talking directly to the datastore, could it also create sagas? Could it communicate to external services?
Idk. Maybe it wouldn't be so bad to allow the entrypoints to create LookupPaths
and make simple calls to the datastore, but I guess I'd want to have a better idea of where we draw the complexity threshold line.
On your authz points, I totally agree - that's better served as a separate module. I think that's a great example of refactoring logic, but even in the CRUD cases in Nexus, I think it would be easier to grok the code if things were more organized by resource.
Here are some examples of the downside of a giant nexus.rs file:
If we try reading it as a logical narrative, we need to get to line 480 before we get to the first CRUD function. Worse, the "operations on each resource" are not next to each other. Between the internal / external API boundaries, container / direct object access, it hasn't been totally clear where to put what!
Here are the functions within the impl Nexus
, as it exists today:
new_with_id
. Okay, constructor. Makes sense.wait_for_populate
. Sure, seems related to the constructor.opctx_external_authn
. Auth helper, not a resource.unimplemented_todo
. Acts as a sentinel endpoint.upsert_sled
. Our first resource, from an internal API.upsert_zpool
. A different resource, here because it's part of the internal API.upsert_dataset
. Another different resource.upsert_oximeter_collector
. Another different resource.register_as_producer
. This is actually an internal helper function - not related to a resource.build_oximeter_client
. Helper function, used exclusively in oximeter-related functions in this file. Note: there are other functions to return oximeter clients ~3000 lines away, in next_collector
.oximeter_list
: This looks like it could be related to an HTTP entrypoint, but I don't see it being used.datastore
: This is a getter function.execute_saga
: A helper function for Nexussilo_create
: A bunch of "Silo" functions...silo_fetch
: ...silos_list_by_name
: Note, this is on line 725...silo_user_fetch
: This is on line 3803 - it's over 3000 lines away from the other silo functions!!! I understand that it fetches a user, but this distance makes it incredibly difficult to see "what does a silo contain".This is a big reason why I'd advocate for "split up nexus.rs
by resource" - to make helper functions more locally scoped, to make it clear what resources actually exist, and to make "Nexus utility functions" more distinct from everything else.
Saga execution is another good example - like authz
- that isn't specific to a single resource, and can benefit from separate submodules. But by interleaving the saga helpers in nexus.rs
with operations that do act on resources, it's actually fairly difficult to see where the saga-handling logic lives. For example, Q: is there saga handling logic in Nexus after execute_saga
, on line 644? A: Yes! There are methods to get sagas on line ~3200.
I think your summary of what the layers currently do is accurate. I think it's also somewhat arbitrary. I'm not sure why moving one responsibility would call into question which layers are responsible for other responsibilities. Anyway, if nobody else is bothered by the boilerplate, I'm fine with dropping this because I don't have a great idea. I was just hoping to brainstorm a bit because the boilerplate seems bad to me.
On the more specific question about creating LookupPaths: I'm not convinced that belongs in the application layer. Consider when we go implement #485. If we continue with the current pattern, we'd create a second version of all of the Nexus functions that accepts an id instead of a sequence of names. The next natural step is to instead accept an enum of Id
or PathOfNames
. But that's exactly what lookup::Foo
is. So I think it'd be much better to move the LookupPath
creation into the HTTP layer and change the Nexus functions to accept lookup::Foo
(as in, lookup::Organization
). This example really highlights for me that the HTTP layer's job includes figuring out which application resource is meant by a URL path, not just syntactically parsing the name components out of the URL path.
Here are some examples of the downside of a giant nexus.rs file:
Yes, agreed. I think there's some confusion. I wasn't arguing that it's well-organized right now or that it shouldn't be split up. I was asking whether there are useful subsystems to be created out of the functions that are there, rather than partitioning the existing functions mechanically into files according to the resource they appear to operate on.
I'm curious about the workflow question: how does having this in 45 files make it easier to find whether the type exists already? And is this something where fixing rust.docs.corp.oxide.computer (the autogenerated omicron Rustdocs) would help?
I think using ls
, seeing 45 files, and knowing that each corresponds to a resource I may want makes it much easier to find a resource when compared to parsing a ~3000 LoC file. It also helps me quickly skim over the "only-relevant-in-the-context-of-other-resource" objects, like the ProjectUpdate
struct in the context of a Project
.
Although I think fixing the rustdocs could help, I don't think it needs to be a mutually exclusive solution.
I think your summary of what the layers currently do is accurate. I think it's also somewhat arbitrary. I'm not sure why moving one responsibility would call into question which layers are responsible for other responsibilities.
I basically think it's just a matter of uniformity. I don't feel super strongly about this, but I think it's basically a "pros of less code" vs "pros of predictability" question. I definitely think either could work.
Your example with LookupPaths may be a good justification for modifying that layering. I know you've been removing some of the footguns related to datastore access without valid authz; maybe now it makes sense to be more permissible with access.
Yes, agreed. I think there's some confusion. I wasn't arguing that it's well-organized right now or that it shouldn't be split up. I was asking whether there are useful subsystems to be created out of the functions that are there, rather than partitioning the existing functions mechanically into files according to the resource they appear to operate on.
Gotcha - I think my opinion is that "if we pull the existing functions out by resource, it'll be easier to see what's left".
That could be doable like I did with nexus/src/db/model
- if we had a nexus/src/app
, I could move individual resources to a spot like nexus/src/app/instance.rs
, and leave the remainder of nexus.rs
in nexus/src/app/mod.rs
, as a start? We could continue to pull out subsystems from that, if clear lines can be drawn?
Related to https://github.com/oxidecomputer/omicron/pull/1013 , which split up the
model.rs
file into a directory.Line count of some of the major files:
This is subjective, but personally I find smaller files much easier to navigate - crossing the 1000+ LoC threshold makes it easy to introduce murky coupling, and makes it easy to create merge conflicts when working on fundamentally different portions of the stack.
Before we directly go and break up these files, @davepacheco brought up the question: do we want to restructure these things, rather than merely splitting into smaller files by related function?
Source: