Closed andyfaff closed 9 months ago
HI @andyfaff , coult you explain your idea behind this changes a bit. For me this looks like making the Header classes much harder to extend and adding a lot of code to all classes doing init routines that should actually be handled by the dataclass meta functionality.
What do we gain from using dataclass at all, if we need a custom init function for bascially every class?
Anyway, I'll have a look at it and will also check if it influences functionality anywhere.
Unfortunately the stdlib dataclasses (from lib/dataclasses.py
) aren't designed to accept extra keyword arguments to create attributes that aren't listed on the class. e.g. one can't have HeaderDerivedClass(attr1, extra_attr=5)
, where attr1
is specified in the class definition, but extra_attr
isn't. Because one of our use cases was to allow these unknown attributes we had modify how the dataclasses worked. This was done by creating the orsodataclass
decorator.
orsodataclass
required a knowledge of how dataclasses worked internally, together with the creation of _custom_init_fn
and _create_fn
, as well as several private attributes/functions from the dataclasses module. Unfortunately cp312 altered how some of these functions worked internally. Before cp312 various variables were created in our version of _init_fn
using:
locals = {f"_type_{f.name}": f.type for f in fieldsarg}
However, in cp312 the stdlib _init_fn
creates them like:
locals = {f'__dataclass_type_{f.name}__': f.type for f in fields}
This means that our _custom_init_fn
works with cp<312, but not with cp312 because the rest of lib/dataclasses.py
is incompatible with the variable naming scheme.
It may be possible to make a shim for this particular use case, but it's a fragile solution. I don't think we should be depending on our knowledge of the internals of dataclasses, or on private functions/attributes, because it's inherently subject to change outside our control. Every time a new Python version comes out we have to hope that nothing has changed (e.g. look at orsopy.dataclasses
to see how one has to do various meddling for different Python versions. Speaking of which I could probably delete that file).
So, how is a solution reached? The first solution could be to vendor the lib/dataclasses.py
file from stdlib. In that way we could keep our existing customised dataclass operation. However, I actually find the existing code quite hard to work with, because the monkey patching means you have to be familiar with lib/dataclasses.py
as well as our overlay.
This PR takes a different route. Instead of using an orsodataclass
decorator we just use the dataclass
decorator directly. In order to provide the ability to provide the extra_attr
mentioned above we have to write an __init__
method for each class.
e.g. for orsopy.fileio.base.Value
we add:
def __init__(self, magnitude, unit=None, error=None, *, comment=None, **kwds):
super(Value, self).__init__()
self.magnitude = magnitude
self.unit = unit
self.error = error
for k, v in kwds.items():
setattr(self, k, v)
self.comment = comment
self.__post_init__()
In this way we can add extra kwds without any issues. The stdlib
does allow for dataclasses to provide their own __init__
methods, so operation is quite simple. One just has to make sure that one calls __post_init__
at the end of initialisation. IMO this approach is much simpler than the existing code. It's straightforward to extend/add extra attributes. We just add an extra parameter to the __init__
method, and specify the attribute in the class definition. The drawback is that it does add an amount of boilerplate initialisation.
What do we gain from using dataclass at all, if we need a custom init function for bascially every class?
It's basically the same __init__
method, but with different attribute names. The main reason we still benefit from using dataclasses is the validation carried out in Header.__post_init__
and Header._resolve_type
. This is important when creating instances (no point in creating Value("qwerty")
when it should be Value(1.0)
) of all the Header classes. However, it's even more important when we de-serialise all the yaml. dataclasses can store all their types in their field information, and we use those types to figure out how to marshal the data into the different Header classes.
What about this as the init function for Header, inherited by all the subclasses? It seems to work. Then you don't have to define an __init__
function for every class.
def __init__(self, *args, **kwargs):
self._orso_optionals = []
own_fields = fields(self)
own_field_names = [f.name for f in own_fields]
field_index = 0
for a in args:
# field order is guaranteed
setattr(self, own_field_names[field_index], a)
field_index += 1
for (k, v) in kwargs.items():
setattr(self, k, v)
if not hasattr(self, 'comment'):
self.comment = kwargs.get('comment', None)
self.__post_init__()
(I'm not sure if the special handling for comment is required)
Update: this __init__
function makes all the tests pass (after removing all the other init functions except class Orso and class Header:
class Header:
"""
The super class for all of the items in the orso module.
"""
_orso_optionals: List[str] = []
def __init__(self, *args, **kwargs):
self._orso_optionals = []
own_fields = list(fields(self))
for a in args:
# field order is guaranteed
field = own_fields.pop(0)
setattr(self, field.name, a)
# process the rest of own_fields, in order:
for field in own_fields:
if field.name in kwargs:
field_value = kwargs.pop(field.name)
elif field.default_factory is not MISSING:
field_value = field.default_factory()
elif field.default is not MISSING:
field_value = field.default
else:
raise ValueError(f"missing positional argument \"{field.name}\" for class {self.__class__.__name__}")
setattr(self, field.name, field_value)
# process any remaining kwargs:
for (k, v) in kwargs.items():
setattr(self, k, v)
self.__post_init__()
I am curious why extra kwargs are added as class attributes for every class except Orso? Why do we use the specially-created attribute user_data
for that one class?
Attached is a diff, that if applied to this PR (in its current state) makes it pass all the tests. inherit_init_patch.zip
Update: this init function makes all the tests pass
I tried the patch. There are pros and cons. On the one hand it makes all the classes derived from Header
cleaner by removing all the boiler plate. However, it has an unfortunate side effect of making the help/docstring more opaque - the call signature gets replaced by *args, **kwargs
instead of the actual signature that would be accepted.
>>> Person?
Init signature: Person(*args, **kwargs)
Docstring:
Information about a person, including name, affilation(s), and contact
information.
File: ~/Documents/Andy/programming/orsopy/orsopy/fileio/base.py
Type: type
Subclasses:
There would seem to be a tradeoff here. Personally I would prefer to see the correct signature than remove the boilerplate. What are your thoughts? (I wonder if it'd be possible to write yet another decorator that would restore the function signature that would be displayed by help)
I am curious why extra kwargs are added as class attributes for every class except Orso? Why do we use the specially-created attribute user_data for that one class?
I can't remember why it's called user_data
rather than anything else. It's used to include/absorb extra fields from the Header. I'd be wary about removing it though.
(I wonder if it'd be possible to write yet another decorator that would restore the function signature that would be displayed by help)
e.g.
def inherit_docstring_from(cls):
pars = []
for f in fields(cls):
if not f.default:
default = f.default
else:
default = Parameter.empty
p = Parameter(f.name, Parameter.POSITIONAL_OR_KEYWORD, annotation=f.type, default=default)
pars.append(p)
doc = cls.__doc__
cls.__doc__ = "Init signature: " + cls.__name__ + str(Signature(pars)) + "\n" + doc
return cls
@inherit_docstring_from
@dataclass(init=False)
class Pop(Header):
"""
My test
"""
_id: int
unit: Optional[str] = field(default=None, metadata={"description": "SI unit string"})
error: Optional[int] = None
comment: Optional[str] = None
would give something like:
Init signature: Pop(*args, **kwargs)
Docstring:
Init signature: Pop(_id: int, unit: Optional[str] = None, error: Optional[int] = None, comment: Optional[str] = None)
My test
Type: type
Subclasses:
which would be better than nothing
I agree, the lack of meaningful signatures is not great. This problem we're trying to solve, creating classes where core attributes are strongly typed and documented, but where extra attributes can be added (and serialized, and deserialized!), is hard. I have to say it feels weird to tack on random attributes to a defined class here in almost 2024 - while it felt completely normal to me 10 years ago. I think the accepted practice on this is shifting with the mainstream use of type hints and dataclasses.
The advantage of being able to cope with extra attributes can be important in versioning orsopy. For example, we could add new attributes in a later orsopy. Files created by that newer version should still be loadable by an older version that copes with extra attributes.
I propose:
I would be against merging this PR, we should have an extended discussion about this as it is the basis for all orsopy classes. (And it might even break genx or other software the builds on orsopy by extending classes.)
I have a few main objections with this solution:
For me there are two alternative routs that might be more practicle:
I'm sorry to slow you down at this point. I'm aware that this is just based on my opinion and if everyone disagrees you can obviously go ahead in this direction. Unfortunately I didn't have much time to invest in the last couple of months and will be on vacation starting tomorrow. If you don't mind to wait so long, we could have a videocall end of January about this.
I hope you have a good holiday. Given that you'll be away before I can reply in full, would the genx test suite pick up any issues if we were to go down this path?
I think I agree with Artur that we should rethink our approach. It probably merits a longer discussion, when we're all back from the holidays. I think dataclasses provide some helpful features for defining a schema in python, but they are intentionally designed to not support arbitrary init kwargs, since it's kind of the point of a dataclass to define all your fields.
I also don't understand why downstream users couldn't inherit from Header subclasses and add their own fields, instead of tacking attrs to existing classes. For the migration problem (supporting future changes to the schema), I think we'd be much better served by developing migration functions, which is a fairly standard way to handle schema changes. The likelihood that future changes to the ORSO standard structure would be purely additive or subtractive of fields is pretty small, and I think that is the only type of schema change we could try to accomodate with arbitrary kwargs.
I know I am likely wrong about some of my concerns above, but my vote is to hold off on any refactor for now. There's not even e.g. a numba wheel for python 3.12 yet so I don't feel like there's a rush to have complete support for it at the moment.
What do you think of a solution like this: (found on stackoverflow)... instead of trying to force extra kwargs into dataclass init functions (which is hard and fragile), how about adding a new method "from_kwargs" for instantiating one of our classes with extra attributes? https://stackoverflow.com/a/55101438
We could use that function in the deserializer when reading .ort
files, and then we wouldn't have to modify the __init__
. The code for from_kwargs(**kw)
would be much simpler, since we're not trying to do all the things dataclass initializers do, just run the built-in init and then add new kw.
This could be a solution that could also be handled with inheritance for most cases without having to write a custom init for every single case. We will have to discuss how to treat the "comments" case. I'll have a look at the code a gain and will try some prototyping. I think after diving back into the code I can better weigh the various pros and cons, too.
@andyfaff @bmaranville @arm61 Can we find a date for a video call to discuss the strategy? For me the next weeks Mo/Tu look quite good.
I've made a test implementation, have a look at #120
Hi Artur, good to see you back online. I'll have a look at your suggestions in the next few days.
Resolved by #120
Adds a whole bunch of boilerplate in init functions, but would get orsopy working on 3.12.