reflectivity / orsopy

http://www.reflectometry.org/orsopy/
MIT License
0 stars 8 forks source link

MAINT: get orsopy working in 3.12. Closes #115 #116

Closed andyfaff closed 9 months ago

andyfaff commented 1 year ago

Adds a whole bunch of boilerplate in init functions, but would get orsopy working on 3.12.

aglavic commented 1 year ago

HI @andyfaff , coult you explain your idea behind this changes a bit. For me this looks like making the Header classes much harder to extend and adding a lot of code to all classes doing init routines that should actually be handled by the dataclass meta functionality.

What do we gain from using dataclass at all, if we need a custom init function for bascially every class?

Anyway, I'll have a look at it and will also check if it influences functionality anywhere.

andyfaff commented 1 year ago

Unfortunately the stdlib dataclasses (from lib/dataclasses.py) aren't designed to accept extra keyword arguments to create attributes that aren't listed on the class. e.g. one can't have HeaderDerivedClass(attr1, extra_attr=5), where attr1 is specified in the class definition, but extra_attr isn't. Because one of our use cases was to allow these unknown attributes we had modify how the dataclasses worked. This was done by creating the orsodataclass decorator.

orsodataclass required a knowledge of how dataclasses worked internally, together with the creation of _custom_init_fn and _create_fn, as well as several private attributes/functions from the dataclasses module. Unfortunately cp312 altered how some of these functions worked internally. Before cp312 various variables were created in our version of _init_fn using:

locals = {f"_type_{f.name}": f.type for f in fieldsarg}

However, in cp312 the stdlib _init_fn creates them like:

locals = {f'__dataclass_type_{f.name}__': f.type for f in fields}

This means that our _custom_init_fn works with cp<312, but not with cp312 because the rest of lib/dataclasses.py is incompatible with the variable naming scheme.

It may be possible to make a shim for this particular use case, but it's a fragile solution. I don't think we should be depending on our knowledge of the internals of dataclasses, or on private functions/attributes, because it's inherently subject to change outside our control. Every time a new Python version comes out we have to hope that nothing has changed (e.g. look at orsopy.dataclasses to see how one has to do various meddling for different Python versions. Speaking of which I could probably delete that file).

So, how is a solution reached? The first solution could be to vendor the lib/dataclasses.py file from stdlib. In that way we could keep our existing customised dataclass operation. However, I actually find the existing code quite hard to work with, because the monkey patching means you have to be familiar with lib/dataclasses.py as well as our overlay.

This PR takes a different route. Instead of using an orsodataclass decorator we just use the dataclass decorator directly. In order to provide the ability to provide the extra_attr mentioned above we have to write an __init__ method for each class.

e.g. for orsopy.fileio.base.Value we add:

    def __init__(self, magnitude, unit=None, error=None, *, comment=None, **kwds):
        super(Value, self).__init__()
        self.magnitude = magnitude
        self.unit = unit
        self.error = error
        for k, v in kwds.items():
            setattr(self, k, v)
        self.comment = comment
        self.__post_init__()

In this way we can add extra kwds without any issues. The stdlib does allow for dataclasses to provide their own __init__ methods, so operation is quite simple. One just has to make sure that one calls __post_init__ at the end of initialisation. IMO this approach is much simpler than the existing code. It's straightforward to extend/add extra attributes. We just add an extra parameter to the __init__ method, and specify the attribute in the class definition. The drawback is that it does add an amount of boilerplate initialisation.

What do we gain from using dataclass at all, if we need a custom init function for bascially every class?

It's basically the same __init__ method, but with different attribute names. The main reason we still benefit from using dataclasses is the validation carried out in Header.__post_init__ and Header._resolve_type. This is important when creating instances (no point in creating Value("qwerty") when it should be Value(1.0)) of all the Header classes. However, it's even more important when we de-serialise all the yaml. dataclasses can store all their types in their field information, and we use those types to figure out how to marshal the data into the different Header classes.

bmaranville commented 1 year ago

What about this as the init function for Header, inherited by all the subclasses? It seems to work. Then you don't have to define an __init__ function for every class.

    def __init__(self, *args, **kwargs):
        self._orso_optionals = []
        own_fields = fields(self)
        own_field_names = [f.name for f in own_fields]
        field_index = 0
        for a in args:
            # field order is guaranteed
            setattr(self, own_field_names[field_index], a)
            field_index += 1
        for (k, v) in kwargs.items():
            setattr(self, k, v)
        if not hasattr(self, 'comment'):
            self.comment = kwargs.get('comment', None)

        self.__post_init__()

(I'm not sure if the special handling for comment is required)

bmaranville commented 1 year ago

Update: this __init__ function makes all the tests pass (after removing all the other init functions except class Orso and class Header:

class Header:
    """
    The super class for all of the items in the orso module.
    """
    _orso_optionals: List[str] = []

    def __init__(self, *args, **kwargs):
        self._orso_optionals = []
        own_fields = list(fields(self))
        for a in args:
            # field order is guaranteed
            field = own_fields.pop(0)
            setattr(self, field.name, a)
        # process the rest of own_fields, in order:
        for field in own_fields:
            if field.name in kwargs:
                field_value = kwargs.pop(field.name)
            elif field.default_factory is not MISSING:
                field_value = field.default_factory()
            elif field.default is not MISSING:
                field_value = field.default
            else:
                raise ValueError(f"missing positional argument \"{field.name}\" for class {self.__class__.__name__}")
            setattr(self, field.name, field_value)
        # process any remaining kwargs:
        for (k, v) in kwargs.items():
            setattr(self, k, v)

        self.__post_init__()

I am curious why extra kwargs are added as class attributes for every class except Orso? Why do we use the specially-created attribute user_data for that one class?

Attached is a diff, that if applied to this PR (in its current state) makes it pass all the tests. inherit_init_patch.zip

andyfaff commented 1 year ago

Update: this init function makes all the tests pass

I tried the patch. There are pros and cons. On the one hand it makes all the classes derived from Header cleaner by removing all the boiler plate. However, it has an unfortunate side effect of making the help/docstring more opaque - the call signature gets replaced by *args, **kwargs instead of the actual signature that would be accepted.

>>> Person? 
Init signature: Person(*args, **kwargs)
Docstring:     
Information about a person, including name, affilation(s), and contact
information.
File:           ~/Documents/Andy/programming/orsopy/orsopy/fileio/base.py
Type:           type
Subclasses:     

There would seem to be a tradeoff here. Personally I would prefer to see the correct signature than remove the boilerplate. What are your thoughts? (I wonder if it'd be possible to write yet another decorator that would restore the function signature that would be displayed by help)

I am curious why extra kwargs are added as class attributes for every class except Orso? Why do we use the specially-created attribute user_data for that one class?

I can't remember why it's called user_data rather than anything else. It's used to include/absorb extra fields from the Header. I'd be wary about removing it though.

andyfaff commented 1 year ago

(I wonder if it'd be possible to write yet another decorator that would restore the function signature that would be displayed by help)

e.g.

def inherit_docstring_from(cls):
    pars = []
    for f in fields(cls):
        if not f.default:
            default = f.default
        else:
            default = Parameter.empty
        p = Parameter(f.name, Parameter.POSITIONAL_OR_KEYWORD, annotation=f.type, default=default)
        pars.append(p)

    doc = cls.__doc__
    cls.__doc__ = "Init signature:  " + cls.__name__ + str(Signature(pars)) + "\n" + doc
    return cls

@inherit_docstring_from
@dataclass(init=False)
class Pop(Header):
    """
    My test
    """
    _id: int
    unit: Optional[str] = field(default=None, metadata={"description": "SI unit string"})
    error: Optional[int] = None
    comment: Optional[str] = None

would give something like:

Init signature: Pop(*args, **kwargs)
Docstring:     
Init signature:  Pop(_id: int, unit: Optional[str] = None, error: Optional[int] = None, comment: Optional[str] = None)

My test
Type:           type
Subclasses:     

which would be better than nothing

bmaranville commented 1 year ago

I agree, the lack of meaningful signatures is not great. This problem we're trying to solve, creating classes where core attributes are strongly typed and documented, but where extra attributes can be added (and serialized, and deserialized!), is hard. I have to say it feels weird to tack on random attributes to a defined class here in almost 2024 - while it felt completely normal to me 10 years ago. I think the accepted practice on this is shifting with the mainstream use of type hints and dataclasses.

andyfaff commented 1 year ago

The advantage of being able to cope with extra attributes can be important in versioning orsopy. For example, we could add new attributes in a later orsopy. Files created by that newer version should still be loadable by an older version that copes with extra attributes.

I propose:

aglavic commented 12 months ago

I would be against merging this PR, we should have an extended discussion about this as it is the basis for all orsopy classes. (And it might even break genx or other software the builds on orsopy by extending classes.)

I have a few main objections with this solution:

  1. If we want people to adapt and extend the data format, it should be as simple as possible for them. They should not have to know about our library internals to be able to create a Header derived class. That is given if you can just derive from another class or use the same logic as when creating dataclasses but not the case if you need to write custom init functions whose logic you don't understand as an outsider.
  2. The previous solution concentrated business logic to deal with these keywords in one place, now it is spread over all classes and might make as suceptible to bugs if something is missed in one of these classes. It also makes the code less readable.
  3. Skipping the dataclass init removes some of the dataclass functionality and could again brake, if they change how dataclasses work. For me it also seems that we re-do the main functionality of dataclasses ourselfs, in which case we could directly remove that part and keep the whole logic contained in orsopy.

For me there are two alternative routs that might be more practicle:

  1. Go for the metaclass solution that we had in discussion at the beginning of the project. This way the classes can just be derived withou any need for @dataclass etc.
  2. Follow Andrews suggestion of just shipping orsopy with the original dataclass module from python 3.11 (maybe reneamed) and the modifications we need for orsopy. For me this is the easiest and cleanest solution, all internal code can stay the same and we probably don't want future changes in how types/dataclasses work in this project, anyway.

I'm sorry to slow you down at this point. I'm aware that this is just based on my opinion and if everyone disagrees you can obviously go ahead in this direction. Unfortunately I didn't have much time to invest in the last couple of months and will be on vacation starting tomorrow. If you don't mind to wait so long, we could have a videocall end of January about this.

andyfaff commented 12 months ago

I hope you have a good holiday. Given that you'll be away before I can reply in full, would the genx test suite pick up any issues if we were to go down this path?

bmaranville commented 12 months ago

I think I agree with Artur that we should rethink our approach. It probably merits a longer discussion, when we're all back from the holidays. I think dataclasses provide some helpful features for defining a schema in python, but they are intentionally designed to not support arbitrary init kwargs, since it's kind of the point of a dataclass to define all your fields.

I also don't understand why downstream users couldn't inherit from Header subclasses and add their own fields, instead of tacking attrs to existing classes. For the migration problem (supporting future changes to the schema), I think we'd be much better served by developing migration functions, which is a fairly standard way to handle schema changes. The likelihood that future changes to the ORSO standard structure would be purely additive or subtractive of fields is pretty small, and I think that is the only type of schema change we could try to accomodate with arbitrary kwargs.

I know I am likely wrong about some of my concerns above, but my vote is to hold off on any refactor for now. There's not even e.g. a numba wheel for python 3.12 yet so I don't feel like there's a rush to have complete support for it at the moment.

bmaranville commented 11 months ago

What do you think of a solution like this: (found on stackoverflow)... instead of trying to force extra kwargs into dataclass init functions (which is hard and fragile), how about adding a new method "from_kwargs" for instantiating one of our classes with extra attributes? https://stackoverflow.com/a/55101438

We could use that function in the deserializer when reading .ort files, and then we wouldn't have to modify the __init__. The code for from_kwargs(**kw) would be much simpler, since we're not trying to do all the things dataclass initializers do, just run the built-in init and then add new kw.

aglavic commented 9 months ago

This could be a solution that could also be handled with inheritance for most cases without having to write a custom init for every single case. We will have to discuss how to treat the "comments" case. I'll have a look at the code a gain and will try some prototyping. I think after diving back into the code I can better weigh the various pros and cons, too.

@andyfaff @bmaranville @arm61 Can we find a date for a video call to discuss the strategy? For me the next weeks Mo/Tu look quite good.

aglavic commented 9 months ago

I've made a test implementation, have a look at #120

andyfaff commented 9 months ago

Hi Artur, good to see you back online. I'll have a look at your suggestions in the next few days.

aglavic commented 9 months ago

Resolved by #120