Modern rpc prototype - Githubissues

mathemancer commented 3 months ago

Fixes #3519 Fixes #3518

Summary

This adds a new RPC endpoint, available at /rpc/. At the moment, the endpoint is available only when logged in, and the functions are available only for superusers. The functions are:

connections.create_from_known_connection, and
connections.create_from_scratch.

Technical details

The functions are documented via auto-generated docs (created from their docstrings) in the new documentation page at /user-guide/rpc-functions/.
I elected to go against my previous plan and use django-modern-rpc.
- This seems to be more actively maintained, and
- This integrates with Django auth.
Unfortunately, this meant the auto-generated docs were not good enough. So, I decided to use the mkdocs plugin to include RPC function docs in the docs site instead.
This PR disables the check to prevent type hints!!!
- This was necessary to get the documentation generated automatically.
- Another option is to use the Sphinx (rather than Google) docstring style for the RPC functions, but I find the Google docstring style much more readable. However, the Sphinx style is better for documenting types without type hinting.

Additional context

RPC transition plan

Checklist

- [X] My pull request has a descriptive title (not a vague title like `Update index.md`). - [X] My pull request targets the `develop` branch of the repository - [X] My commit messages follow [best practices][best_practices]. - [X] My code follows the established code style of the repository. - [ ] I added tests for the changes I made (if applicable). - [X] I added or updated documentation (if applicable). - [X] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of Origin

Developer Certificate of Origin

``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```

mathemancer commented 3 months ago

@Anish9901 if you're far enough along on figuring out testing for these, go ahead and add a couple tests. Otherwise, I think we should merge this untested (I did check manually to make sure things were working).

@seancolsen @pavish Take a look and see what you think of:

The style of the docs
The error responses when parameters are missing
The error responses when you call the functions in some illegal manner (e.g., violating a uniqueness constraint on DB keys)

@kgodey It might be worth a glance from you, since this is the initial PR that will define how the RPC endpoint looks and works.

mathemancer commented 3 months ago

This is ready to review, but I'm leaving it as a draft until @Anish9901 replies with an opinion on whether to add tests into this PR.

Anish9901 commented 3 months ago

I'll add some tests.

seancolsen commented 3 months ago

Also, I made a couple edits to the wiki to bring the RPC transition plan into conformance with this PR (e.g. django-modern-rpc and dots in method names).

mathemancer commented 3 months ago

Thank you for this thoughtful feedback, @seancolsen !

You're correct that dependencies between function parameters is a smell. As I read your thoughts, I realized that the connection_type parameter adds precisely zero information, and can only ever produce an error. We should remove it, and just accept an optional connection_id.

As for your broader point: I typically prefer primitive types for "public" function arguments, and prefer them even more strongly in this case where there's a translation between two languages. I'm using "primitive" a bit loosely here to mean any non-composite type. My reasoning is that

Primitives translate more cleanly in general (for example, the datetime types are different for different languages.
Primitives have the advantage of being well-understood and self-documenting when used in a function signature. For example, there's pretty much no question about what an int parameter will accept as valid input, but there are many potential questions about what a Connection parameter will accept as valid input.
Primitives have the advantage that exceptions produced will be more likely to be automatically useful (e.g., forgetting the connection_id when submitting a connection_type should produce a KeyError: "connection_id" error without too much (maybe any) custom handling, and could even be upgraded to
```
TypeError: connection_id keyword arg is required when connection_type="user_database".
```
if we wanted custom handling. If it's nested within an object, we have to then provide some nesting in our error response:
```
TypeError: the connection keyword arg requires a connection_id attribute whenever its connection_type is "user_database".
```
The more nested the object in question is, the worse that problem gets.

With that said, I agree that we'll likely have to handle some kind of nesting if polymorphism is required. In a different language, I'd probably use multiple dispatch in the API-facing functions to sort that whenever possible, but oh well. Another option (even in python) is to simply have differently named methods for different combinations of arguments. Of the concerns above, the self-documenting part is what I'm most loathe to give up.

The most self-documenting solution I can think of is to add data classes representing any nested object to the same module as the function(s) that accept(s) it, and use type hinting to specify that a given argument should be of the type of that data class. The class woudld be picked up by the mkdocs plugin, and even cross-referenced with the type hint. However, that type hint would be a lie! I.e., the actual type received via API call would always be a dict! So, we'd have to then add some branching logic to handle the possible inputs, and document how to interpret the documentation of those data classes. We could try to create a custom __init__, but I expect that would confuse the auto-generated documentation.

A less self-documenting, but in some ways cleaner approach would be to specify that the argument is a dict via type hinting (and so in the docs), and then specify details about the form of the dict in the docstring of the API-facing function.

I suppose I'm just trying to avoid the nesting problem whenever possible. Whenever absolutely necessary, I prefer the latter solution.

Regarding the return type: I tend to prefer returning primitives when possible for most of the same reasons (translation and self-documenting). With that said, I don't have a fundamental problem with returning a flatish object from functions in this context. However, that still leaves the issue of making the function self-documenting. I think the same options mentioned above apply, but I want to tinker with that a bit. I do see your point about non-breaking API changes...it's a good one. With that said, I do think that if a public function is modified to return a different type, it should just be a new function instead.

mathemancer commented 3 months ago

Also, I made a couple edits to the wiki to bring the RPC transition plan into conformance with this PR (e.g. django-modern-rpc and dots in method names).

Thank you for that, and for improving the documentation structure in this PR!

mathemancer commented 3 months ago

@seancolsen As an example (don't merge yet), I threw together a class to handle returning an object and documenting the returned attributes automatically. Not sure I like this, though.

seancolsen commented 3 months ago

General

I'm just trying to avoid the nesting problem whenever possible

Yeah, this was my hunch. Understanding your perspective is helpful here. I think that as the dust settles around the RPC API this is the sort of thing we ought to document in some kind of standard.
We should remove the connection_type parameter, and just accept an optional connection_id.

That sounds like the best path forward for this PR.
we'll likely have to handle some kind of nesting if polymorphism is required

Right. For now though, I say let's just kick this can down the road and deal with it when/if we have a more pressing use case.

I thought this PR would be a good opportunity to take a stab at it and that maybe it would be straightforward to add some nesting.

When/if we do have a more compelling use case for nesting, I have a lot more thoughts in response to your points. But it's probably not worth getting into the weeds now though if we don't really need to.
Another option is to have differently named methods for different combinations of arguments

This occurred to me as well, and I'm open to it. I can see how this approach could very well end up being our solution in some other cases.

Extensibility via new functions

if a public function is modified to return a different type, it should just be a new function instead.

This approach worries me.

To be clear: We're talking about a hypothetical future where we want to change a return value that's already implemented. We don't have that situation yet, but I still think we should be on the same page about our mindset towards extensibility because it affects the design choices we're making now.

If we someday end up with two API functions that take exactly the same arguments and return ever so slightly different return values, that will be a recipe for confusion on the client side. Sure, we could mark one as deprecated. But even still, in many cases it would difficult to create distinct names for the two functions without introducing mess into the API design. I can see how substantial differences in behavior or function signature could warrant the addition of a new function. But I think my bar for introducing a new function would be quite higher than yours.

I'm trying to be very intentional about approaching this RPC API design from a holistic perspective. That's why I've been working to organize and standardize all the function names ahead of time as much as we reasonably can. This way they all make sense together. To me — and I'm speaking from the perspective of an API consumer — there is a very high value to having a cohesive set of API functions which are easy to understand and reason about when taken in conjunction with one another. I don't want to have to think too hard about which function to use in a given circumstance. And for any new devs or third party API consumers (who won't understand the evolution of our API), this goal becomes an even higher priority.

Return types

You said:

I tend to prefer returning primitives when possible for most of the same reasons (translation and self-documenting).

I'd like to better understand your opinion on this point.

We still need to figure this out for this PR because it will continue to block my front end implementation.

Let's take create_from_known_connection as an example. I see three options for our return type:

(A)
```
42
```
(B)
```
{
  "id": 42
}
```

(C)

{
  "id": 42,
  "nickname": "apple",
  "database": "apple",
  "supported_types_url": "…",
  "username": "mathesar",
  "host": "mathesar_dev_db",
  "port": 5432
}

You seem to want (A). My problems with (A) are:

It's not extensible. That's why I prefer (B) in the general sense.
In this particular case, it's incompatible with the REST API response that we're replacing and thus will increase implementation time. That's why I prefer (C) in this particular case.

So I want (C). Can you tell me what, specifically, is problematic about (C)? Can you give an example?

seancolsen commented 3 months ago

I threw together a class to handle returning an object and documenting the returned attributes automatically

I looked at the docs that this generates. Cool! This works for me!

Not sure I like this, though.

I'd like to hear why.

mathemancer commented 3 months ago

I threw together a class to handle returning an object and documenting the returned attributes automatically

I looked at the docs that this generates. Cool! This works for me!

Not sure I like this, though.

I'd like to hear why.

The thing I don't like is that to get the documentation to work, I had to add a fallacious type hint for the return value. I.e., it doesn't actually return a DBModelReturn, it returns a dict with keys matching the attributes of that class. This means the docs will be inaccurate for any user who is calling the function from Python, rather than via the API. It will also cause linters and so on to throw warnings or errors everywhere if we ever use the return value of the function in a call from Python.

seancolsen commented 3 months ago

@mathemancer

What about something like this?

from typing import TypedDict

# ...

def filter_dict(d, cls):
    return cls(**{key: d[key] for key in cls.__annotations__ if key in d})

class DBModelReturn(TypedDict):
    """
    Information about a database model.

    Attributes:
        id (int): The Django id of the Database object created.
        name (str): Used to identify the created connection.
        db_name (str): The name of the database on the server.
        username (str): The username of the role for the connection.
        host (str): The hostname or IP address of the Postgres server.
        port (int): The port of the Postgres server.
    """

    id: int
    name: str
    db_name: str
    username: str
    host: str
    port: int

Then as a return value:

filter_dict(db_model.__dict__, DBModelReturn)

mathemancer commented 3 months ago

@mathemancer

What about something like this?

...

Neat! I wasn't aware of TypedDict. I pushed a change that used a similar concept, and I think we'll probably both be happy with the result. I also made the "create" -> "add" change. I'm fine with that verb, as long as we're consistent.

mathemancer commented 3 months ago

It occurs to me that returning the id field won't make much sense once we're constructing connections out of parts of different models under the hood. @seancolsen Are you actually using that? What would you actually need returned for:

Displaying to the user, and
Expected next requests?

I think the current return object is okay as a proof-of-concept, but I think we should avoid just returning what the REST API returns, since these functions won't actually be managing a single resource in the near future.

mathemancer commented 3 months ago

Since we're returning the port as an int, it might be nice to accept it that way too, i.e. by changing the port request param within add_from_scratch from str to int.

Good catch.

In DBModelReturn the property names within the docstring do not match the property names defined on the class. ~I don't have a preference between these different names~ (edit 👇), but they should be consistent. I would be nice to make sure they're all consistent with the request too.

Ah. Let this be a lesson about being clever. We can't (and probably shouldn't) automatically pick up the API-expected names for fields from the underlying model attributes.

(edit) Actually I do have a slight preference — I'd like to use nickname (instead of name) and database (instead of db_name) because that's what the REST API uses and it would be nice to minimize changes over the REST API unless we have a good reason to do so.

I'm fine with that, but see my other comment. We should double-check that the returned object's attributes all make sense with the new models and API.

mathemancer commented 3 months ago

@mathemancer

I love this! The auto-generated docs look great!

I have two comments:

I'd like to improve how code within error responses work. For a request to connections.add_from_known_connection, we get the same code for all errors thrown by the db & for connection errors to the db. ...

These are two different db errors and I'd like a way for the frontend to be able to differentiate between them. Currently, both return the same value for code. I'm fine with having a separate property to differentiate them and code being the same.

Agreed. I haven't gotten around to adding custom error codes yet, but plan to do so. I wanted to just get it so we're at least returning JSON even when an error happens, then get more precise about the codes. I'll add codes for the two errors you mentioned as an initial example. I figure we should consider custom codes for various errors to be feature requests (beyond the basics, anyway). Does that make sense to you?

I'd like the error message structure to be documented in the docs (not individually for all requests, a common block on the page would be cool).

Great idea; will do.

seancolsen commented 2 months ago

@mathemancer yes, the front end needs the connection id field when the user edits or deletes a connection. Your comment sort of opens up a pandora's box of thoughts... It's is a slippery slope into a territory where we're no longer talking strictly about the API-layer transition, but instead talking about the larger architectural transition, including the new models. I can appreciate that API and architecture are inextricably intertwined, but so far I've been approaching this PR as if it's somehow isolated — operating only within the API layer as much as possible. If we're aiming to make the new connections RPC APIs in this PR work well with the new models, then I'll potentially have much more critique. For example, when all the dust settles, I don't expect we'll have a concept of a "connection" anywhere in the stack. In my opinion the very premise of this API is flawed if we're to view it in the context of the new models.

This is making me think we might need to go into greater detail about the smaller steps we intend to take over the course of this whole transition.

Alternatively, we could consider kicking off the RPC transition by picking a sort of "leaf" API instead of a "trunk" API. For example /api/ui/v0/users/{userId}/password_reset/ seems to be far less intertwined with the architectural transition. We could take that endpoint (and ones like it) in isolation a bit more easily.

kgodey commented 2 months ago

I'm considering my review complete.

mathemancer commented 2 months ago

@mathemancer yes, the front end needs the connection id field when the user edits or deletes a connection. Your comment sort of opens up a pandora's box of thoughts... It's is a slippery slope into a territory where we're no longer talking strictly about the API-layer transition, but instead talking about the larger architectural transition, including the new models. I can appreciate that API and architecture are inextricably intertwined, but so far I've been approaching this PR as if it's somehow isolated — operating only within the API layer as much as possible. If we're aiming to make the new connections RPC APIs in this PR work well with the new models, then I'll potentially have much more critique. For example, when all the dust settles, I don't expect we'll have a concept of a "connection" anywhere in the stack. In my opinion the very premise of this API is flawed if we're to view it in the context of the new models.

This is making me think we might need to go into greater detail about the smaller steps we intend to take over the course of this whole transition.

Alternatively, we could consider kicking off the RPC transition by picking a sort of "leaf" API instead of a "trunk" API. For example /api/ui/v0/users/{userId}/password_reset/ seems to be far less intertwined with the architectural transition. We could take that endpoint (and ones like it) in isolation a bit more easily.

We've already gotten most of the juice I was hoping for out of this PR. I want to:

make sure we're aligned on the overall concept now that we have an example,
iron out details of the call and return style in the abstract (I think we've done that),
iron out details of the error response style (still working on that), and
implement the main scaffolding needed to make these functions work under the hood.

I suppose my question about the id was inspired by the hope that there might be low hanging fruit (i.e., just don't return the id) that we could knock out now. I think for now, we should just return the object as-is (with the attribute name changes you requested).

A broader question we should talk about is our philosophy w.r.t. return objects composed of pieces of various models. One advantage of the RPC style is that we can return such composite objects easily. I've pondered my thoughts about this more, and I think that I'm friendlier to nesting and complex objects in the output from than the input to the RPC functions.

mathemancer commented 2 months ago

I think this is ready for re-review, all current feedback is handled ( i hope )

mathemancer commented 2 months ago

@mathemancer

In both methods, I see db_name in the request but database in response. I'd like these to be consistent. Can we change the request to use database?

Sure, that's fine with me. I just left it that way since I figured it would make the front end changes more minimal. As I type this, though, it seems like it's unlikely to produce a bunch of work for y'all to make it more consistent.

I'm not sure what the purpose of the get_error_code method is. I tried playing with it to see if I could understand it better, but I wasn't able to get it to work. I tried sending: ...

Argh. I see now that my attempts to document the error handling have produced more confusion than clarity. The get_error_code function isn't intended to be called from the RPC endpoint. I added it to the documentation simply to provide the error code schema (I.e., which numbers go with which libraries, etc.). I'll try something different.

mathesar-foundation / mathesar

Modern rpc prototype #3524

Summary

Technical details

Additional context

Checklist

General

Extensibility via new functions

Return types