Open george-zubrienko opened 11 months ago
@lidatong @s-vitaliy @matt035343 @healthmatrice @artificial-aidan and many others please let us know your thoughts on this one :)
Updated to make code look more like its actual python and not Java lol. Re how we will, if agreed, do this - by creating a new "release" v1 branch and working there so people can test the lib from git commit before we even consider pushing to PyPI.
I probably will hate the metaclass implementation. I used to use lots of metaclasses in one of my project. But in the end we almost ditched all the metaclasses and I never used them ever in other projects. Because it is still a class factory which can create lots of headache for type checkers.
I think mypy is still struggling https://github.com/python/mypy/issues/2653 https://github.com/python/mypy/labels/topic-metaclasses
I probably will hate the metaclass implementation. I used to use lots of metaclasses in one of my project. But in the end we almost ditched all the metaclasses and I never used them ever in other projects. Because it is still a class factory which can create lots of headache for type checkers.
I think mypy is still struggling https://github.com/python/mypy/issues/2653 https://github.com/python/mypy/labels/topic-metaclasses
This is good to know! If mypy has issues with those, then the metaclass approach doesn't seem like a viable alternative.
I probably will hate the metaclass implementation. I used to use lots of metaclasses in one of my project. But in the end we almost ditched all the metaclasses and I never used them ever in other projects. Because it is still a class factory which can create lots of headache for type checkers.
Mostly agreed. Furthermore, if anyone does already use metaclasses, their code would become even more broken because all metaclasses should derive from the same parent which is not always the case.
@george-zubrienko personally I like the idea of going forward, but I think this changeset is too breaking to be a root cause of dependency management mess. The reason being is the fact DCJ is often used in the dependency libraries or frameworks, and since there's no way for pip to provide different versions for each of them, if any of them wants to update, every other one should update too.
IMAO, we should keep the backwards compatibility while changing the core logic behind AND throwing a ton of DepreciationWarnings in the process.
The real improvement I can ask for is to allow user to use precompilation step for (de)serializer (similarly to how DCs constructor work) and use those instead of relying on the meta fields. This change could be breaking and/or not working in existing user cases, thus I suggest hiding it behind either enable optimization Python flag or some DCJ setting (either per-class or global).
frameworks, and since there's no way for pip to provide different versions for each of them, if any of them wants to update, every other one should update
Poetry solves this issue, but if we bump to 1.0 and cut out current functionality, users wont be able to mix those anyway. Worst case people will (they really should though) pin major to 0.x until their lib is ready to upgrade.
IMAO, we should keep the backwards compatibility while changing the core logic behind AND throwing a ton of DepreciationWarnings in the process.
One way could be by making v1 API opt-in instead of forced-use and as you said, get some deprecation warnings in. But then how long v0 API should stay in code, and why can't we rely on people pinning major?
de)serializer (similarly to how DCs constructor work) and use those instead of relying on the meta fields.
I tend to get rid of annotation usage all together to be fair, so in your case this would go to a separate configuration/serializer class like MyTypeSerializer
that will be picked from global scope.
The reason I think hiding constructor logic in the annotation is not a great choice (also applies to how dataclasses are implemented in python) is its poor compatibility with code analysis tools and the fact their order matters when multiple are used, and the fact it mutates whatever goes in instead of adding metadata to the type that can be used at runtime.
I tend to get rid of annotation usage all together to be fair, so in your case this would go to a separate configuration/serializer class like
MyTypeSerializer
that will be picked from global scope.
The second part of my message has nothing to do with annotations.
Imagine this code:
from dataclasses_json import settings
from dataclasses_json import JsonSerializer
settings.enable_optimizations()
JsonSerializer[DataPerson].to_json(d_person)
Which internally checks if there is a compiled serializer and, if not, compiles one so the futher runs would be faster.
JsonSerializer[DataPerson].from_json('{"name": "lidatong"}') # DataPerson(name="lidatong")
How would you achieve such behavior? Usually generics won't provide their own type variables (unless typing
has changed how it works)
Imagine this code:
from dataclasses_json import settings from dataclasses_json import JsonSerializer settings.enable_optimizations() JsonSerializer[DataPerson].to_json(d_person)
Which internally checks if there is a compiled serializer and, if not, compiles one so the futher runs would be faster.
Makes sense, I was thinking of having something like dataclasses_json.initialize().with_serializer(<custom serializer here)
So initialize
does internally checks if there is a compiled serializer and, if not, compiles one so the futher runs would be faster.
and with_serializer
adds user-defined serializers
How would you achieve such behavior? Usually generics won't provide their own type variables (unless
typing
has changed how it works)
You can check for the actual type provided by class-scoped type variable and then try find a configured (compiled) serializer for it. Small correction, code will look like
JsonSerializer[DataPerson]().from_json('{"name": "lidatong"}') # DataPerson(name="lidatong")
Probably also related to #84 and #264
is there any plan to support discriminator? https://github.com/Fatal1ty/mashumaro#discriminator-config-option
is there any plan to support discriminator? https://github.com/Fatal1ty/mashumaro#discriminator-config-option
I don't think we will need that at all with the new API. As I mentioned, class hierarchies should be ser-desered naturally in v1, at least that would be the goal. Re other libraries, I don't think taking functionality from there is a healthy thing. DCJ has a great core and what we seek is to improve that, not port features from another libs, especially if they won't be needed (hopefully).
hey thanks so much @george-zubrienko for writing this up. i see that there is a lot of good thinking here and triggered a nice discussion :) thanks especially for synthesizing the user issues and pain points
in general i would strongly prefer not to introduce such a major breaking change. technically, keeping the library pre-1.0 "reserves that right" but there are enough users of the library and downstream libraries that a breaking change to the core API would be suboptimal (https://github.com/lidatong/dataclasses-json/network/dependents) (reading above basically echoing @USSX-Hares)
if we do decide to do breaking changes, i would really like them to be contained (e.g. refactoring the way unrecognized fields are handled is one thing on my mind) or perhaps gating them behind "optional" features like enabling a more optimized build of the library (i've been floating the idea of re-writing the parser core in C)
however, i do see a lot of value in what you wrote about introducing the top-level from_json
, etc. in addition to the reasons you stated, one existing issue is ser/de types that you don't define (e.g. A depends on B and wants to ser/de classes in B). you can workaround with DataClassJsonMixin.to_json
but it's not obvious and not documented
(side note: can it just be from_json[T](...)
? i haven't played around too much with python typing)
so basically i think we can have our cake and eat it too by introducing additive API changes. so users that want the quality of life improvements that come with top level from_json
etc. can upgrade if they choose. but users that don't want to refactor their codebase can also continue to get improvements and it avoids a Python 2 / 3 maintain both situation
let me know if that makes sense to you? and thanks again this writeup is much appreciated
quick sketch:
to_json(person) # more typing-friendly, does not modify your user type, can be done on third-party types
person.to_json() # existing API, schema pre-compiled at module load time so maybe faster (doesn't do this currently)
Hey thanks for kind words, I do believe something good will come out of this :)
Re from_json[T](...)
would be nice but python doesn't allow generics on methods this way, if I read the typevar docs correctly. You can only do something like this
from collections.abc import Sequence
from typing import TypeVar
T = TypeVar('T') # Declare type variable "T"
def first(l: Sequence[T]) -> T: # Function is generic over the TypeVar "T"
return l[0]
a = first([1,2,3,4]) # 1: int
b = first(["a", "b", "c"]) # "a": str
so basically i think we can have our cake and eat it too by introducing additive API changes. so users that want the quality of life improvements that come with top level from_json etc. can upgrade if they choose. but users that don't want to refactor their codebase can also continue to get improvements and it avoids a Python 2 / 3 maintain both situation
I think it makes sense not only to me, but also to other contributors, as well as users. I tried to lay out different options so people can iterate a bit in their heads and comment on the solution they would like/support most. I think this is exactly what is happening right now, and it is good to see people being more or less aligned on how we should implement changes. I expect a bit more reactions here next week so we have a full picture. TLDR, the very top-level plan would be:
Sort of related, but coming up with a way to do CI testing for typing on the new APIs. Typing has sort of become a requirement in python recently (from what I'm seeing) and making sure that all of the type checkers handle the new APIs would be great.
My brief glance over the current CI is that it just runs mypy on the codebase (which isn't a bad thing), but a set of test files exercising the different corner cases for typing would be ideal.
And maybe this just looks like running the type checkers on the test files š¤·āāļø
Re
from_json[T](...)
would be nice but python doesn't allow generics on methods this way, if I read the typevar docs correctly. You can only do something like this
Actually, we could do something like that by manually defining __getitem__
operation for the method.
My brief glance over the current CI is that it just runs mypy on the codebase (which isn't a bad thing), but a set of test files exercising the different corner cases for typing would be ideal.
@artificial-aidan if you came up with these new test cases it would be great.
@george-zubrienko, something like that:
... I was about to write here, but then noticed a really strange behaviour of PyCharm -- when I call these test code from the file I defined the JsonSerializer
and to_json
, it correctly shows all type errors and all but one deprecation mark, but when from the other file, it goes rogue.
This may be related to this PyCharm issue I've reported a while ago. Can be fixed by explicitly defining the type:
JsonSerializer: JsonSerializerGeneric = JsonSerializerGeneric()
to_json = JsonSerializer
I've found the more proper solution that works with type checkers but requires metaclasses.
Looks great, I'll have a more thorough look later this week. Re metaclasses, I think given the concerns expressed about them, maybe we should consider that as a last resort option, but good that it works :)
Re the to_json[MyClass]
notation, it is a bit more java-ish, though looks good and clean. However, the question is, what does it bring to the table compared to more verbose-ish Python syntax JsonSerializer[MyClass]().to_json
or JsonSerializer[MyClass].to_json
?
Unfortunately, GH doesn't support custom emoji reactions :c.
I prefer option 2 because it's more clean. Alternatively, we can use the following construction (or any similar):
JsonSerializer.to_json(data, model=MyClass, **options)
JsonSerializer.to_json(data, model=MyClass, **options)
I like this one, but I'd also like to hear what other think, so let's see what pops up during the week!
On this message, and messages directly below, vote with š or š . Any variations are allowed. Suggestions, as well as new options, are allowed.
JsonSerializer[MyClass]().to_json(data)
(with parenthesis)
JsonSerializer[MyClass].to_json()
(No parenthesis)
JsonSerializer.to_json(data, MyClass)
(Class passed as argument)
@USSX-Hares, I would prefer Option 2, but why do you need the generic type for converting to JSON at all? I suppose, we should have a generic from_json method that returns T and untyped to_json that accepts any dataclass.
Also https://github.com/lidatong/dataclasses-json/issues/31 should be solved in v1 API more or less - I'm switching pins so this issue gets a bit more attention when people come in :)
@USSX-Hares, To make two first options work, there is needed some magic code that (imo) reduces the transparency of this library.
What about a fourth option:
JsonSerializer(MyClass).to_json(data)
Basically using the constructer to initialize instead of the custom getter?
I'm actually in favor of option 4, as it is more pythonic and involves less magic method juggling.
JsonSerializer.to_json(data)
and JsonSerializer(MyClass).from_json(data)
We'll also need serializer registration as a bonus:
dataclasses_json.serializers.register(...)
In order to make the lib thread-safe, register should be scoped to current thread (no globals), but available for forking in other threads.
I'll start creating issues for implementing v1 API features from Monday :)
A side note: we should allow users to force data type for to_json
method for situation when they need to send data without any extra fields added by children classes.
Also, I still recommend the option 3 since it is still Pythonic and avoids confusion like "should I instantiate JsonSerializer for this particular case?" unless you can both always do that and pass additional arguments to the constructor which were per-call parameters earlier.
Will open a PR for this one this/next week to get some code in https://github.com/lidatong/dataclasses-json/issues/453
More issues to be created soon after :)
All issues labelled with api/v1
and not-currently-assigned are up for grabs once we complete the #453. I'll also put help-wanted on them after that.
@george-zubrienko and the others, I am continuing my little interrogation.
dataclasses_json.CatchAll
.CamelCaseClass
, whose fields should be serialized as camel case, and a class PleasNoMore
which actively declines any extra parameters; one is a field on another.
Great questions :) below are my thoughts
- How the annotation-related features should be used in the new, non-decorated approach? I refer to the things like
dataclasses_json.CatchAll
.
I think most of those features should be part of "base serializer case", so we just build them into the JsonSerializer
class base functionality. For example, the CatchAll
behaviour can be kept as-is, if enabled with a flag on JsonSerializer
constructor, or disabled/reconfigured.
Maybe we can also consider exposing those via toml file config section? Like this:
[tool.dataclasses_json.config]
undefined_fields = INCLUDE
letter_case = Camel
default_serializer = dataclasses_json.serializers.JsonSerializer
...
IMO that would be cool and clear and no need for runtime code config.
- How the settings merging should be done? Let's assume we have class
CamelCaseClass
, whose fields should be serialized as camel case, and a classPleasNoMore
which actively declines any extra parameters; one is a field on another.
Excellent question, so I'd vote for this:
class CamelCaseClass:
no_more: PleasNoMore
class PleaseNoMore:
me_no_camel: int
# assume PleaseNoMore forces default ser, while CameCase wants Camel letter case
a = CamelCaseClass(no_more=PleaseNoMore(me_no_camel=1))
to_json(a) # '{"noMore": { "me_no_camel": 1 }}'
If it does not, we inherit settings from the parent, if not from parent, then from global
- How they should be registered?
I'm actually so in favor of the toml file, I'd suggest instead of python code registration, we move this to toml file like i did in the example. We can have smth like
[tools.dataclasses_json.config]
serializers = ["dataclasses_json.serializers.JsonSerializer", "my_package.dcj_serializers.*:, ...]
Registration reads this and adds them all to "current thread" cache. Note that "current thread" may be a bit of wishful thinking on my end re thread safety, so take that with a bit of salt :)
- How the (de)serializer should be called?
When to_json
/from_json
is invoked by user code, we first take the target type they specify or, if not provided, the passed object type for to_json
case. We go through fields(my_dataclass)
and for each field type we do smth like
serde = dataclasses_json.config.registered_serializers.get(field.type, getattr(my_dataclass, field.name))
If we have such serializer, we apply the respecitve to_json
/from_json
from it, which will again, scan fields, for each field type invoke the registered serializer. Now, if we do not have the registered serializer for the type, we throw a SerializerNotDefinedError
, where we specify the type we cannot find the serDe class for.
Now for the settings like letter etc, maybe this? Then each serializer instance can read them from conf easy?
[tool.dataclasses_json.config] # global section
undefined_fields = INCLUDE
letter_case = Camel
default_serializer = dataclasses_json.serializers.JsonSerializer
serializers = ["dataclasses_json.serializers.JsonSerializer", "my_package.dcj_serializers.*:, ...]
[tool.dataclasses_json.config.my_package.dcj_serializers.MyNewSerializer]
undefined_fields = IGNORE
letter_case = Kebab
...
- When and where these parameters are defined, and what to do when a conflict occurs?
Looking at toml, just follow the levels, if not defined for this serializer, take from global.
- Can we override them per-stack-only?
Yes, if we allow to_json/from_json
to have an additional argument overriding serializer settings
- Should we allow complex user (de)serializer registration? Like allowing registering generic types (which should know over which they are generic), or specific rules or conditions when/how the (de)serializer should run?
That would be hard to implement, but even it was easy, I think stuff like this can lead to user code bases losing in maintainability and readability. Depends on the feature, however. Like, conditions to run or not run the specific ser behaviour can be controlled by class structure/class serializer settings. Generics, on the other hand, idk :)
Alright starting the work on this one finally
Description
Based on some discussions we see in issues around like #433 , #243, #139, #106 and probably more, I can pinpoint a few common pains:
@dataclass_json
or the subclass approach withDataClassJsonMixin
, the latter being similar to whatmashumaro
does. There are also a few things that can only be done from the annotation or global/field level config, most common example beingletter_case=...
.pylint
hates it so in our projects we always use both, which adds boilerplate and confusion, especially for new developers. Now we can also see from both recent and past issues that people in general have to be creative to make their static code analysis not hate the lib. This is a usability problem and I think it must be addressed one way or another.I'd like to propose a breaking API change in 1.0 release to get rid of those issues, and potentially greatly improve user experience.
Possible solution
Consider current quickstart:
Change it to this:
Now, this removes the need to have either annotation OR the subclass and simply relies on the class def supplied by generic/type arg (subject for discussion) and generates the ser-deser boilerplate under the hood. This approach formalizes the return type of all ser-deser operations, thus making static code analysis happy, fixing autocompletion AND removes the mutation of usercode.
Note this also allows to supply encoder-decoder configuration on either global or method level via
dataclasses_json.cfg.global_config.encoders
,dataclasses_json.cfg.global_config.decoders
. For method level we can allow to override decoders/encoders viaencoders=...
anddecoders=...
argument.For class level, we can do the same via
dataclasses_json.cfg.global_config.encoders[DataPerson]
. For class field level this will be a problem. Alternative to all this can be ditchingdataclasses_json.cfg.global_config.encoders/decoders
and instead doing this:Then users can create a simple initialization file for their code where this is all set up once and for the lifetime of the app. Other big plus of this approach is that we essentially lock ser-deser pair for the user and will be able to provide clear and understandable error message if things go sideways.
Alternatives
As an alternative, we can only ditch annotation and instead convert
mixin
to a metaclass, which will allow us to mutate user code as we see fit to achieve proper ser-deser behaviours. This would allow us to achieve similar results by rewriting less code, but will require people to have a bit higher level knowledge of python in order to understand how the lib works and how to use it for their (potentially) complicated use case. Also metaclasses is one of the areas that has seen a breaking change not so long ago.Context
For context, some thoughts from community members on this: