scipp / sciline

Build scientific pipelines for your data
https://scipp.github.io/sciline/
BSD 3-Clause "New" or "Revised" License
10 stars 2 forks source link

Parameter type #139

Closed jl-wynen closed 8 months ago

jl-wynen commented 8 months ago

We came across the need to have a richer parameter handling multiple times:

See also the guidelines https://github.com/scipp/essreduce/blob/main/docs/user-guide/reduction-workflow-guidelines.md

Proposed solution

I propose adding a decorator that adds the above to a user type. Here is a basic prototype:

from __future__ import annotations

from numbers import Integral
from pathlib import Path
from typing import Any, NewType, TypeVar

import sciline as sl

A = NewType('A', str)
B = NewType('B', str)
T = TypeVar('T')

def underlying_new(ty: type):
    # Scope requires the underlying type to be the last base class.
    # The code here uses the same assumption and is just as restrictive.
    if ty.__bases__[0] == sl.Scope:
        return ty.__bases__[1].__new__
    return ty.__new__

def parameter():
    def deco(ty: type) -> type:
        # Add a tag so Pipeline can recognise parameters:
        ty.__sciline_parameter__ = ty.__name__
        old_new = underlying_new(ty)

        # Override __new__ instead of __init__ because the latter does not work
        # for immutable types like int.
        def new(cls: ty, *args: Any, **kwargs: Any) -> ty:
            if hasattr(ty, '__validate__'):
                ty.__validate__(*args, **kwargs)
            # In addition to __validate__, we can also have __convert__
            # if __new__(*args, **kwargs) is not sufficient to support all
            # argument types.
            # Or take the same approach as Pydantic and allow __validate__ to do conversions.
            return old_new(cls, *args, **kwargs)

        if (ann := getattr(old_new, '__annotations__', None)) is not None:
            new.__annotations__ = ann
        ty.__new__ = new

        return ty

    return deco

# A generic positive int that can be constructed from an numbers.Integral type.
@parameter()
class PositiveInt(sl.Scope[T, int], int):
    """An int that is greater equal 0."""

    @staticmethod
    def __validate__(value: Integral):
        if int(value) != value:
            raise ValueError(f"PositiveInt must be an integer, got {value}")
        if value < 0:
            raise ValueError(f"PositiveInt must be positive, got {value}")

# A simpler type without custom behaviour except that parameter values
# are converted to Path automatically.
#
# This could also be defined using
#   FilePath = parameter()(NewType('FilePath', Path))
# but that requires tweaks to parameter but it couldn't have a docstring in that case.
@parameter()
class FilePath(Path):
    """Path to a file."""

def f(x: PositiveInt[A]) -> float:
    return 0.5 * x

def g(p: FilePath) -> str:
    return f'path:{p.with_suffix(".xye")!r}'

pl = sl.Pipeline(
    [f, g], params={PositiveInt[A]: 3.0, PositiveInt[B]: 5, FilePath: 'a.csv'}
)
print(repr(pl.compute((float, str))))  # prints {<class 'float'>: 1.5, <class 'str'>: "path:FilePath('a.xye')"}

The only change this requires is inserting this

base = origin if origin is not None else key
if hasattr(base, '__sciline_parameter__'):
    param = base(param)

here: https://github.com/scipp/sciline/blob/8467e25b9857fc40fda16de00f8eacc25fe193cf/src/sciline/pipeline.py#L430

Alternative:

Supporting typing.Annotated (#36) would allow us to add metadata to types which is useful could be sued for the above purposes. But as noted in the issue, this may be brittle. And encoding all relevant forms of metadata can be difficult.

SimonHeybrock commented 8 months ago

Have you thought about our front-page claim https://scipp.github.io/sciline/#but-i-do-not-want-to-change-all-my-code. Can the same be done without having Sciline-specific attributes, etc.?

jl-wynen commented 8 months ago

Users need to define new types anyway (using NewType or Scope). What I propose does not require changes to the users types.

But it is intrusive in that it adds attributes to the class. If we want to avoid that, we need a different way of identifying parameters and attaching validators to them. I can think of three possibilities:

  1. Keep doing it via the type but use a global registry of parameters. The parameter decorator would register types there.
    • Unintrusive for the type
    • Validation happens during pipeline construction, not param construction.
    • Global registry allows inspection of available params. But this is not tied to any particular pipeline.
  2. Do it via Pipeline. Add a method like Pipeline.add_parameter(ty, value, validate, convert).
    • Unintrusive for the type
    • Pipeline user needs to pass correct arguments to function.
    • Validation happens during pipeline construction, not param construction.
    • Inspection needs pipeline which in turn needs parameter values
  3. Separate Pipeline.register_parameter(ty, validate, convert) and Pipeline.set_parameter(ty, value).
    • Unintrusive for the type
    • Pipeline builder needs to pass correct arguments to function.
    • Validation happens during pipeline construction, not param construction.
    • Inspection needs pipeline but no param values yet.
    • Central place to discover params

For completeness:

  1. Current proposal
    • Intrusive for the type
    • Param implementer defines validation
    • Validation happens during param construction. This includes cases where params are computed by providers.
    • Param classes can be inspected.
    • No central place to discover all params.
SimonHeybrock commented 8 months ago

Users need to define new types anyway (using NewType or Scope).

They can use any type they want. If they already have strong types instead of generic containers in their code no change is needed. And NewType is in the standard library, i.e., no dependency on Sciline is added.

What I propose does not require changes to the users types.

Where would the decorator run / be used?

jl-wynen commented 8 months ago

I was operating under the assumption that users anyway have to define new types because it is extremely rare to use a dedicated type for every single entity. And it would then be ok to depend on Sciline and use the decorator in that code as it will anyway require a lot of uses of sciline.Scope.

SimonHeybrock commented 8 months ago

And it would then be ok to depend on Sciline and use the decorator in that code as it will anyway require a lot of uses of sciline.Scope.

Not sure I agree. I have always seen sciline.Scope as hack hack for generic aliases. It caters to some lazyness about defining a proper generic type. And it will go away with Python 3.12's generic aliases, I think?

jl-wynen commented 8 months ago

In that case, can you comment on the alternatives?

SimonHeybrock commented 8 months ago

Options 1-3 seem very invasive. Looking at the original requirements you posted on top:

  • Validate parameter values

  • Document parameters, e.g., for building a GUI (Docstrings are not supported for type aliases. And we may need more structured data, e.g., to encode valid ranges.)

  • Support different types. While the linked issue is closed, the current solution requires using Union types and providers have to handle the different types themselves.

  • Default values.

How about using Pydantic, or something like that? It seems to fulfill 4/4 of the above. Downside is the need to access the param property in providers, but is that actually a problem, compared to the advantages of using an existing and well-known solution?

# SPDX-License-Identifier: BSD-3-Clause
# Copyright (c) 2023 Scipp contributors (https://github.com/scipp)
from pydantic import BaseModel, Field
import sciline as sl
from pathlib import Path
from typing import Union

def test_simple_param() -> None:
    class Parameter(BaseModel):
        value: int = Field(..., description="An int", ge=0, le=100)

    def use_param(param: Parameter) -> float:
        return 1.5 * param.value

    pipeline = sl.Pipeline((use_param,))
    pipeline[Parameter] = Parameter(value=10)
    assert pipeline.compute(float) == 15.0

def test_param_with_two_different_value_types() -> None:
    class Param(BaseModel):
        value: Path = Field(..., description="A string or a path")

    def use_path(param: Param) -> bool:
        return isinstance(param.value, Path)

    pipeline = sl.Pipeline((use_path,))
    pipeline[Param] = Param(value="test")
    assert pipeline.compute(bool) == True
    pipeline[Param] = Param(value=Path("test"))
    assert pipeline.compute(bool) == True
jl-wynen commented 8 months ago

I considered this but didn't go that way because it requires providers to be aware of it. Also, it also requires defining new classes. And it gives special meaning to pydantic models. We can no longer use them for data that shouldn't be treated as params.

SimonHeybrock commented 8 months ago

Also, it also requires defining new classes.

Don't you need to do that anyway if you want validators and docs?

And it gives special meaning to pydantic models. We can no longer use them for data that shouldn't be treated as params.

You mean because you need to access a property?

jl-wynen commented 8 months ago

Also, it also requires defining new classes.

Don't you need to do that anyway if you want validators and docs?

You complained about this earlier. This is why I proposed 1-3

And it gives special meaning to pydantic models. We can no longer use them for data that shouldn't be treated as params.

You mean because you need to access a property?

No. I mean when listing all parameters. It would be nice to put all providers into a pipeline and then loop through all keys (present and missing) to list all parameters to, e.g., build a GUI or other interface. But this is based solely on the types. So if we rely on pydantic models, all models used anywhere in the graph would be interpreted as parameters. (You may say that we can exclude keys that are produced by providers. But that doesn't work because we may have providers that are alternatives for determining params. E.g., There may be a Filename that can be converted into a FilePath. But the user can just as well specify a FilePath directly.)

SimonHeybrock commented 8 months ago

Also, it also requires defining new classes.

Don't you need to do that anyway if you want validators and docs?

You complained about this earlier. This is why I proposed 1-3

I did? Where?

And it gives special meaning to pydantic models. We can no longer use them for data that shouldn't be treated as params.

You mean because you need to access a property?

No. I mean when listing all parameters. It would be nice to put all providers into a pipeline and then loop through all keys (present and missing) to list all parameters to, e.g., build a GUI or other interface. But this is based solely on the types. So if we rely on pydantic models, all models used anywhere in the graph would be interpreted as parameters.

This seems like conflated requirements, not what the original premise of this issue was.

  1. I am not saying that "we" use Pydantic models, I am basically just suggesting that no change to Sciline may be required. The user may use Pydantic models, or anything else.
  2. Finding/listing all parameters can simply be done by (a) defining it in a module/package, which is probably what one might need anyway, since automatic seldom does the job, or (b) find all "unsatisfied" requirements in a pipeline (we do this already for task-graph visualization).
jl-wynen commented 8 months ago

I did? Where?

Here:

Have you thought about our front-page claim https://scipp.github.io/sciline/#but-i-do-not-want-to-change-all-my-code.

The page says

Sciline is very non-invasive. If you are willing to define a few types (or type aliases) and add type annotations to your functions

But you now suggest wrapping data in dedicated models. Yes, those are independent from Sciline. But they are still tacked on top of the normal user code.


This seems like conflated requirements, not what the original premise of this issue was.

I'd rather not refactor this if we want UIs.

I am not saying that "we" use Pydantic models, I am basically just suggesting that no change to Sciline may be required. The user may use Pydantic models, or anything else.

Sure. Users can do that. I wanted to see how Sciline can help.

Finding/listing all parameters can simply be done by (a) defining it in a module/package, which is probably what one might need anyway, since automatic seldom does the job, or (b) find all "unsatisfied" requirements in a pipeline (we do this already for task-graph visualization).

(a) Then this can be done directly in the UI code. So the whole discussion is moot. We can just do all documentation, validation, conversion, etc in the UI code. (Or a dedicated module) But this is hard to extend when we add to a pipeline.

(b) This fails with the Filename / FilePath example i gave.

SimonHeybrock commented 8 months ago

I did? Where?

Here:

Have you thought about our front-page claim https://scipp.github.io/sciline/#but-i-do-not-want-to-change-all-my-code.

The page says

Sciline is very non-invasive. If you are willing to define a few types (or type aliases) and add type annotations to your functions

But you now suggest wrapping data in dedicated models. Yes, those are independent from Sciline. But they are still tacked on top of the normal user code.

Dedicated by the user, yes. Validators and docstrings must be defined with a parameter, how did you intend to do this outside the user code? I don't see how that can be done without writing a class. My point is: I do not want to require writing that code (parameters, validators, ...) in a Sciline-specific manner.

This seems like conflated requirements, not what the original premise of this issue was.

I'd rather not refactor this if we want UIs.

Refactor what?

I am not saying that "we" use Pydantic models, I am basically just suggesting that no change to Sciline may be required. The user may use Pydantic models, or anything else.

Sure. Users can do that. I wanted to see how Sciline can help.

What are you referring to here, avoiding to write param.value instead of param?

Finding/listing all parameters can simply be done by (a) defining it in a module/package, which is probably what one might need anyway, since automatic seldom does the job, or (b) find all "unsatisfied" requirements in a pipeline (we do this already for task-graph visualization).

(a) Then this can be done directly in the UI code. So the whole discussion is moot. We can just do all documentation, validation, conversion, etc in the UI code. (Or a dedicated module) But this is hard to extend when we add to a pipeline.

Not sure what you are stating here.

(b) This fails with the Filename / FilePath example i gave.

Every solution fails somewhere.

jl-wynen commented 8 months ago

I'd rather not refactor this if we want UIs.

Refactor what?

The definitions of all parameters in all ess packages because we find out that they don't work for building a UI. If we go back to the stance that we will never provide GUIs and never implement helpers for it, then this doesn't matter.

What are you referring to here, avoiding to write param.value instead of param?

Yes and having to wrap values in a class and having to come up with an attr name (value). I wanted to have a solution that is transparent. I.e., apart from the parameter decorator, it looks and feels like using pipelines without special param handling.

(a) Then this can be done directly in the UI code. So the whole discussion is moot. We can just do all documentation, validation, conversion, etc in the UI code. (Or a dedicated module) But this is hard to extend when we add to a pipeline.

Not sure what you are stating here.

I'm saying that, if pushing all responsibility for validation onto the user and treating GUIs as a separate matter, why do we talk about this in the first place?

(b) This fails with the Filename / FilePath example i gave.

Every solution fails somewhere.

Sure. But we need to pick the features we want to support. So does this mean that you don't want to support the example I gave?

SimonHeybrock commented 8 months ago

I'd rather not refactor this if we want UIs.

Refactor what?

The definitions of all parameters in all ess packages because we find out that they don't work for building a UI. If we go back to the stance that we will never provide GUIs and never implement helpers for it, then this doesn't matter.

Maybe you can explain in person, but I am not sure I understand why using an @parameter decorator avoids any need for refactoring, while some other solution cannot.

What are you referring to here, avoiding to write param.value instead of param?

Yes and having to wrap values in a class and having to come up with an attr name (value). I wanted to have a solution that is transparent. I.e., apart from the parameter decorator, it looks and feels like using pipelines without special param handling.

Maybe I just don't understand your proposal. How will you implement the bullet points from your original message, specifically (1) doc/description, (2) validators, and (3) default values. How can that be done without a class or something similar?

(a) Then this can be done directly in the UI code. So the whole discussion is moot. We can just do all documentation, validation, conversion, etc in the UI code. (Or a dedicated module) But this is hard to extend when we add to a pipeline.

Not sure what you are stating here.

I'm saying that, if pushing all responsibility for validation onto the user and treating GUIs as a separate matter, why do we talk about this in the first place?

Sciline does not know how parameters should be validated. Only the Sciline user can do that. Again, I do not understand what you have in mind otherwise.

(b) This fails with the Filename / FilePath example i gave.

Every solution fails somewhere.

Sure. But we need to pick the features we want to support. So does this mean that you don't want to support the example I gave?

Let us focus on the main discussion for now, as there are apparently a few misunderstandings, so we should avoid getting side-tracked.

jl-wynen commented 8 months ago

A comment about using custom classes with pydantic: It is non-trivial to use types as fields in pydantic models that are not supported natively. And it needs a modification of the class. E.g., to make it work with scipp, we need something like this:


from typing import Any, Never, TypeVar

import scipp as sc
from pydantic import BaseModel, GetCoreSchemaHandler
from pydantic_core import core_schema

import sciline as sl

T = TypeVar('T')

def identity(x: T) -> T:
    return x

def serialize_var(x: sc.Variable) -> Never:
    raise NotImplementedError()

def schema(
    cls, _source_type: Any, _handler: GetCoreSchemaHandler
) -> core_schema.CoreSchema:
    return core_schema.no_info_after_validator_function(
        identity,
        core_schema.is_instance_schema(sc.Variable),
        serialization=core_schema.plain_serializer_function_ser_schema(
            serialize_var, info_arg=False, return_schema=core_schema.str_schema()
        ),
    )

# Need to modify the class!
sc.Variable.__get_pydantic_core_schema__ = classmethod(schema)

class Range(BaseModel):
    value: sc.Variable

def foo(r: Range) -> int:
    return len(r.value)

params = {Range: Range(value=sc.arange('x', 4))}
pl = sl.Pipeline((foo,), params=params)
print(pl.compute(int))  # -> 4

Ideally, the user should not modify a scipp (or numpy, pandas, etc) class. So this needs by-in by the corresponding library. (I'm not opposed to supporting pydantic schemas for our types. This can also help with serializing parameters.)

jl-wynen commented 8 months ago

Conclusion of offline discussion:

Points 1, 2, 4 can be done outside of Sciline with an existing package or a bespoke solution. But it does not fit the core purpose of Sciline which is building task graphs.

Concerning param inspection and building GUIs, we can have common classes, functions, protocols in another package, e.g., ESSreduce. This can be used by our own special-purpose GUI-building library.

SimonHeybrock commented 8 months ago

👍

Concerning param inspection and building GUIs, we can have common classes, functions, protocols in another package

Note also #83, which would enable this.

With that conclusion, I am going to close this, please correct me if I am wrong.