radiasoft / pykern

Apache License 2.0
5 stars 7 forks source link

pykern.quest simple form #205

Closed robnagler closed 1 year ago

robnagler commented 2 years ago

A Quest is "a long or arduous search for something." It is also a Pythonesque word so appropriate. The reason it is called Quest is that it is longer than a simple function call, since it must validate its inputs, e.g. an HTTP request, command line execution (pkcli), a Web Socket message, etc.

There are several classes in sirepo.quest:

Sirepo will subclass these, e.g. a sirepo.api.Base is a subclass of pykern.quest.Call, and used as follows:

    @api_perm.require_user
    def api_deleteSimulation(self):
        req = self.parse_post(id=True)
        simulation_db.delete_simulation(req.type, req.id)
        return self.reply_ok()

Would turn into:

    @sirepo.api.Spec('require_user', id='SimId')
    def api_deleteSimulation(self):
        simulation_db.delete_simulation(self.args.type, self.args.id)
        self.done(status='ok')

self.result(status='ok') would be shortened to self.result_ok().

In the above example, sirepo.api.Spec always adds type='SimType' and applies the @api_perm.require_user. The value auth_uid is also in args. All values in args are validated, and collisions are managed safely, e.g. if a caller tries to supply auth_uid, it will be rejected, since that type of arg is special (the structures for this have to be defined clearly).

args will be parsed from whatever inputs make sense for the way the Call is made. If it is an API request and there is a URI with parameters, those values will be extracted, validated, and add to args. If there is a POST or PUT, those values will be validated and added to args. This is already being done to a degree with parse_post, but this formalizes all inputs.

The types of the inputs are found in pykern.pktype. For Sirepo sirepo.pktype will define its additional types. The type interface has yet to be defined, but this is part of the project.

e-carlin commented 2 years ago

Sounds good. A few thoughts:

Cargill says "why" is an important question. I'll try to answer that on my own but I'd like your feedback to make sure we're on the same page. The api_perm part is more or less the same as we are doing now. The new part is moving parse_post (or the lack there of) into a defined and always included location. That's a good why. With all of the jupyter/oauth stuff it became clear to me that leaving calling parse_post up to the user creating the api is dangerous. Are there other whys you think are important?

I'm concerned some with using id=xxx type=yyy. It doesn't seem flexible enough. For example, api_runMulti we want to validate all of those things but the payload is a list of objects we want to validate. What about using some structure like json schema that we can use to validate the payloads? It is less specific to our domain but more flexible. Not sure if that trade off is good. Another thought with using schemas (although it could probably be done with kwargs too) is that we could make them deny by default. If a schema doesn't contain a field in a payload that field is stripped and we don't get access to it in self.args. But, maybe strict type checking like that is a bad fit for python and our blobs of data.

Another thought the schemas brings up: Do we want to validate the return payloads too?

robnagler commented 2 years ago

Good questions.

I agree with the first paragraph. Other whys:

What about using some structure like json schema that we can use to validate the payloads?

There are a lot of options here (jsonrpc.org, graphql.org, etc.) for the definition language, but as a wise programmer once said, it's better to program in Python. We are buying into Python for the declarative structure.

Another thought with using schemas (although it could probably be done with kwargs too) is that we could make them deny by default.

Yes, yes, yes. And, we have a path here that's got to be backwards compatible until we've fixed all the old code to properly declare all inputs. We need a type system and pktype is a placedholder that allows us to be forwards compatible with those declarations. We will introduce a new model for declarations that also encompasses the schema, which currently describes too much in one way.

Another thought the schemas brings up: Do we want to validate the return payloads too?

The GUI has to trust the (authenticated) server. The server (call_api) has to trust the itself. Validating payloads only makes sense for untrusted (unauthenticated) sources.

The point of this issue is to build a structure that allows us to move forward without having to change any of the semantics. The structure itself can be put into place with just a bit of code and a global replace or two.

e-carlin commented 2 years ago

Thanks for the details. Sounds good.

robnagler commented 2 years ago

I'll add some context since it just crossed my mind:

        state.isStateCanceled = function() {
            return simulationStatus().state == 'canceled';
        };

        state.isStateCompleted = function() {
            return simulationStatus().state == 'completed';
        };

        state.isStateError = function() {
            return simulationStatus().state == 'error';
        };

The type model should answer these questions so all that needs to happen is getting access to the underlying object, e.g.

state.simulationStatus().state.isError()

The type module would dynamically create isError (or something similar) from a specification. This is done with Python's enum in a limited way, pktype would provide a module with a lot of policy about names and such that would allow us to create enumerated types on the fly from the schema.

e-carlin commented 2 years ago

The type module would dynamically create isError (or something similar) from a specification.

Can you be more specific? I'm having a hard time visualizing what that means. Also, since that code is in js how are we going to share knowledge between pktype and js?

I thought about this some more and I'd like to get more specific with how this interface will deal with api_runMulti. Specifically can it eliminate this check (checking for what api and and proper auth)? If so, can you provide a little pseudocode for what that would look like?

robnagler commented 2 years ago

Let's assume it is in "the" schema:

SimulationStatus: {
     pktype: 'Struct',
     values: {
         state: 'SimulationState',
     },
},
SimulationState: {
    pktype: 'Choices',
    values: ['error', 'completed', 'canceled'],
},

There would be both JS and Py interpreters that would read this and turn this into classes. I don't know how we are going to handle name spaces on the JS side, but on the Py side, we could do something like pktype.SimulationState. The namespace would be flat for all dynamically created types, and conflicts would cause an error. The reason to use Json is obvious, but it's rather annoying. We could use Javascript (real classes) and parse it in Python with the help of node (dynamically or statically).

I thought about this some more and I'd like to get more specific with how this interface will deal with api_runMulti.

runMulti just needs to have the right entry points available. With more metadata, it'll be easier to do this.

Part of the change will be to obviate the need for runMulti, because websockets will allow us to bundle messages. The "server" (which will be the same process as the supervisor) will just pass on the messages as it receives them, and if we code it right, there won't be any responses to be processed.

e-carlin commented 2 years ago

There would be both JS and Py interpreters that would read this and turn this into classes.

Makes sense that would be a really nice feature. It has similarities to some of the code generation from proto files in gRPC although with considerably more intelligence for our domain.

runMulti just needs to have the right entry points available. With more metadata, it'll be easier to do this.

To have any useful comments I'm going to need more specifics. What exactly are "right entry points"? Same with "more metadata"? Can you give an example entry point and an example payload with some words about how the entrypoint would process the payload?

robnagler commented 2 years ago

gRPC although with considerably more intelligence for our domain.

It would be good to look at all of these before deciding on a schema. What is important is that the design by living and have a structure for plugins.

What exactly are "right entry points"? Same with "more metadata"?

The concept of a "call" does not exist. We're adding that. This line should be abstracted into "iterate calls", which by definition would check the permissions and names. awaitReply should not be known by the GUI. The GUI should be fully async. The supervisor should simply reply back when replies come back in a CATOCS'ish way. (Best effort, no guarantees.)

job_api will go away (mostly) once the supervisor and server are merged.

mkeilman commented 2 years ago

Also posed here

Is it appropriate to use constants (so @sirepo.api.Spec(api.REQUIRE_COOKIE_SENTINEL)). Further, should Spec get subclassed (so @sirepo.api.SpecRequireCookieSentinel)? Or is that defeating the purpose of "Quests"?

(to clarify, the subclass idea seems to gain nothing over the current system except the underlying class structure. Although you would then have @sirepo.api.SpecRequireUser(id='SimId'). Not sure if that's helpful)

robnagler commented 2 years ago

Thanks for the cross link.

Spec should have a permission, not be a permission. Permissions are currently very narrow, because of exactly this problem: we only allow one permission per API. This a bug, which would be difficult to correct with the current interfaces. With the above change, we could then say:

@sirepo.api.Spec(('require_user', 'require_admin'))

This is just a simple example of what can be done with "hasa" vs "isa" in this case.

robnagler commented 2 years ago

@garsuga @mkeilman This may be more about Sirepo, but we need a type system that expresses more about the instance than just its value. I think it can be immutable, because that's how React thinks about things anyway.

We may want to solve this with multiple inheritance or perhaps be a decorated class which creates proxies on the class. There may only need to be one proxy, e.g. pkget to get the property.

There are two properties that will be useful:

Perhaps you can think of others.

robnagler commented 2 years ago

@jeinstei points us to: https://lwn.net/SubscriberLink/900739/3d41677790e1536c/

I haven't read it but it does talk about units.

mkeilman commented 2 years ago

@jeinstei points us to: https://lwn.net/SubscriberLink/900739/3d41677790e1536c/

I haven't read it but it does talk about units.

I went through it - it's a summary of arguments for adding types to python and different ways to accomplish that. It's all about literals and calculations, whereas we use types (exclusively?) for display purposes. It's built into field labels right now, which is improper, and which I've amended in my explorations by adding units in the kwargs of a field definition. I think trying to deal with units any more deeply than that is best avoided.

One problem we do have with our unit labels on plots is that we can't use things like millimeters (mm), because they get transformed into nonsensical units like "mmm" if the values get small enough. We often have to do extra conversions just for plots. A minor annoyance but it abuts the question at hand.

robnagler commented 2 years ago

Thanks for reading this and summarizing.

Why do we want to avoid dealing with units in computations? I sort of understand, but examples or reasoning would be helpful. Also, how then would we apply the units, only when it was converted to send to the client?

Are the extra conversions for plots on the server?

What does the nonsensical transformation?

mkeilman commented 2 years ago
  • schema location (the absolute name of the element)

Do you mean a namespacey thing like "dog.breed"?

robnagler commented 2 years ago

Yes, absolute location of that so it would be something like models.dog.breed. There may be shorthands but this is necessary so you can locate any element in the schema.

mkeilman commented 2 years ago

Why do we want to avoid dealing with units in computations? I sort of understand, but examples or reasoning would be helpful.

My thinking:

Also, how then would we apply the units, only when it was converted to send to the client?

Yes, if such a conversion is even necessary. Aside from avoiding the weird labels, I contend, generally, it is not.

Are the extra conversions for plots on the server?

I think I'm the only one who has done it so I should not have said "we" :) I do it in js for 3d axes in radia and cloudmc since they aren't plots in the usual sense. Another candidate is the field plots in radia, where we should convert on the server (but currently don't).

What does the nonsensical transformation?

We use d3 (d3.formatPrefix). It doesn't know what a unit "means", but uses the value to add the appropriate prefix.

robnagler commented 2 years ago

Most of the calculations we do are in libraries that assume the proper units. When the input is user-suppled, we say in the UI "This number is to be in millimeters". When we get data back, we say in the UI "This result is in seconds". Why guide the user to use units that differ from those the library expects or delivers?

I was not thinking this would be guiding the user, but used to assert values without having to refer back to the schema to assert the values.

It's just as easy and clear imho to use month_val * _MILLISECONDS_PER_MONTH than (say) month_val.to("milliseconds") to take a recent example.

_MILLISECONDS_PER_MONTH would already be in milliseconds. This is how timedelta works in Python. It knows what it is so you can't do non-sensical things like multiply timedelta by a timedelta:

>>> datetime.timedelta(1)
datetime.timedelta(days=1)
>>> datetime.timedelta(1) * datetime.timedelta(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for *: 'datetime.timedelta' and 'datetime.timedelta'

Pursuant to the first point, we do conversions of units, not calculations with them, if that makes sense. We'd have to go out of our way to add constructions like month_val + 2ms.

The "2ms" would likely never exist in our code. All the values come from the schema so when they are parsed on input from the UI or from disk, they would be converted to types, and then operations performed would transmute the types.

We don't even do that much conversion. Is this a solution in search of a problem?

This is about fail fast and avoiding state decoupling. If the object is converted on input, then it behaves properly always.

We use d3 (d3.formatPrefix). It doesn't know what a unit "means", but uses the value to add the appropriate prefix.

This should be wrapped by us. This is just a bug.

mkeilman commented 2 years ago

Most of the calculations we do are in libraries that assume the proper units. When the input is user-suppled, we say in the UI "This number is to be in millimeters". When we get data back, we say in the UI "This result is in seconds". Why guide the user to use units that differ from those the library expects or delivers?

I was not thinking this would be guiding the user, but used to assert values without having to refer back to the schema to assert the values.

e.g. we currently do this:

<input data-string-to-number="" data-ng-model="model[field]" data-min="info[4]" data-max="info[5]" class="form-control" style="text-align: right" data-lpignore="true" required />

where info is actually SIREPO.APP_SCHEMA.model[name]. Instead:

class NumberInput {
    constructor(schema) {
        this.value = schema.default;
        this.min = schema.min;
        ...
        this.units = schema.units;
    }
    validate() {
        return this.value <= this.min &&...
    }
    toReact() { 
        return <Input data-min="{this.min}"...> {this.units}
    }
}

(or whatever, cobbled from the current reactapp main branch). Once the instance is created it no longer needs to know about the schema. The same sort of thing would also live in python. Is that right? I get the sense there is an additional level of abstraction that I am missing.

Anyway I agree that the units (and the other metadata) belong in the schema. I don't think we need to manage units on the server to the extent mused on in the linked discussion.

robnagler commented 2 years ago

Yes, that's the basic idea.

The same sort of thing would also live in python. Is that right?

That is a good question, and what this discussion is trying to determine. Certainly knowledge needs to be shared.

When we receive data now, we don't validate it. We just use it. This is dangerous and not fail fast. Rather, we should validate all inputs from the UI on the server. Moreover, the UI should only send updates so that this validation isn't expensive. This is where things get tricky, but less of an issue for the fundamental question.

There are more issues than validating an individual input. There needs to be intra-data and extra-data validation, e.g. field1 causes field2 to be a specific range of values. Think of SQL constraints.

robnagler commented 2 years ago

I think types are different than I first thought. As we continue to refine the API spec for Sirepo and the Sirepo React prototype, it seems that types are contracts at key points in the interaction. In the GUI, the contract is about the individual data types, e.g. that a user can input value X for schema value Y. On the server, it's different, we want to know that the data that are coming from either the GUI or the job agents is valid before it is stored in the database, e.g. when saveNewSimulation is called, all of the data have to be checked.

What we don't really need to do is manipulate types while we are doing computations on the server or job_agent (and there should be precious few of these on the client or the server). We only care when the values are put back into the database. Since we are using JSON as the backing store, we have to invent our own constraint mechanism. The schema defines the constraints, but doesn't enforce them.

For the GUI, I think we can do what we are doing: bind types (validators) to the data so that it is convenient in React. There may be a need for inter-field/model validators, and that is a design issue.

For the backend, the problem is different. We want to validate APIs for sure, and we want to validate results coming back from the job agents. Essentially, the server acts as a firewall to the database. For the APIs, I think we need a SimDataDict and a SimDataList that check their values against the schema on assignment. When we get a json value, it will go through a SimData object, which is a subclass of SimDataDict (perhaps). SimData will validate all the input (likely a POST but could be a file just as easily). It could do this by reading the schema in real time or maybe the schema will be converted to a validator structure on boot.

The important point is that we don't have a general need for types once this validation occurs. It may be a FileName is just a string once it passes input validation. This will eliminate problems when passing to, say, open or even pkio.py_path, which doesn't expect a FileName.

Any computation in the job agents will occur with native types as well. Only when the result comes back will be it be validated. This means we need a pkquest.Spec for the messages that come back from the job agents. This is not happening now (nor is there any validation of the sim data coming from the GUI).