materialsproject / jobflow

jobflow is a library for writing computational workflows.
https://materialsproject.github.io/jobflow
Other
90 stars 25 forks source link

Remove enum_values keywords in serialization #565

Closed FabiPi3 closed 2 weeks ago

FabiPi3 commented 3 months ago

Instead of storing only the value of an Enum in the output, I suggest to store the whole enum class in order to be able to recreate it properly. This will be possible due to the new enum support in monty and after merging https://github.com/materialsvirtuallab/monty/pull/640 this.

utf commented 3 months ago

I'm concerned this feature might cause an issue for downstream packages. Often we use Enums as keys for dictionaries. This feature in monty seems like it will break this behaviour since dictionaries cannot be hashed and therefore cannot be used as keys. Have you tried running the atomate2 tests with this new branch of jobflow?

FabiPi3 commented 3 months ago

@utf Thanks for your reply. I tried running the atomate2 tests locally and I can't see any issues arising when changing to my latest changes in monty and jobflow

utf commented 3 months ago

Thanks for checking @FabiPi3

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.42%. Comparing base (eda2a65) to head (978e66f). Report is 14 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #565 +/- ## ======================================= Coverage 99.42% 99.42% ======================================= Files 21 21 Lines 1564 1564 Branches 425 425 ======================================= Hits 1555 1555 Misses 9 9 ``` | [Files](https://app.codecov.io/gh/materialsproject/jobflow/pull/565?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=materialsproject) | Coverage Δ | | |---|---|---| | [src/jobflow/core/job.py](https://app.codecov.io/gh/materialsproject/jobflow/pull/565?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=materialsproject#diff-c3JjL2pvYmZsb3cvY29yZS9qb2IucHk=) | `100.00% <100.00%> (ø)` | | | [src/jobflow/core/reference.py](https://app.codecov.io/gh/materialsproject/jobflow/pull/565?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=materialsproject#diff-c3JjL2pvYmZsb3cvY29yZS9yZWZlcmVuY2UucHk=) | `100.00% <100.00%> (ø)` | |
FabiPi3 commented 3 months ago

In general, I would guess that your described use case is still covered. Looking into the actual implementation of jsanitize, we see:

    if isinstance(obj, dict):
        return {
            str(k): jsanitize(
                v,
                strict=strict,
                allow_bson=allow_bson,
                enum_values=enum_values,
                recursive_msonable=recursive_msonable,
            )
            for k, v in obj.items()
        }

which means for me that only the values are serialized, the keys are still taken as strings. Hope this helps.

gpetretto commented 3 months ago

Hi @FabiPi3, I am also thinking that there might some consequences to this change. To illustrate this I made an example with different objects and (de)serialization, including pydantic base models, that are often present in jobflow's output.

Here is the code:

from monty.json import MSONable, jsanitize
from enum import Enum
from pydantic import BaseModel

class E(Enum):
    A = "A"

class ME(MSONable, Enum):
    B = "B"

class BM1(BaseModel):
    e: E = E.A

class BM2(BaseModel):
    me: ME = ME.B

d = {"e": E.A, "me": ME.B}

print("Dictionary")
for obj in [{"e": E.A}, {"me": ME.B}]:
    for enum_values in (True, False):
        try:
            print(f"{obj}, {enum_values}:", jsanitize(obj, strict=True, allow_bson=True, enum_values=enum_values))
        except AttributeError:
            print(f"{obj}, {enum_values}: failed")

print("\npydantic")
for cls in [BM1, BM2]:
    for enum_values in (True, False):
        try:
            print(f"{cls}, {enum_values}:", jsanitize(cls(), strict=True, allow_bson=True, enum_values=enum_values))
            try:
                print(cls.model_validate(jsanitize(cls(), strict=True, allow_bson=True, enum_values=enum_values)))
            except Exception as e:
                print(f"{obj}, {enum_values}: pydantic failed: {e}")    
        except AttributeError:
            print(f"{obj}, {enum_values}: jsanitize failed")

Here is the output with the standard monty version:

Dictionary
{'e': <E.A: 'A'>}, True: {'e': 'A'}
{'e': <E.A: 'A'>}, False: failed
{'me': <ME.B: 'B'>}, True: {'me': 'B'}
{'me': <ME.B: 'B'>}, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}}

pydantic
<class '__main__.BM1'>, True: {'e': 'A', '@module': '__main__', '@class': 'BM1', '@version': None}
e=<E.A: 'A'>
{'me': <ME.B: 'B'>}, False: jsanitize failed
<class '__main__.BM2'>, True: {'me': 'B', '@module': '__main__', '@class': 'BM2', '@version': None}
{'me': <ME.B: 'B'>}, True: pydantic failed: 1 validation error for BM2
me
  Value error, Must provide ME, the as_dict form, or the proper [type=value_error, input_value='B', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error
<class '__main__.BM2'>, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}, '@module': '__main__', '@class': 'BM2', '@version': None}
me=<ME.B: 'B'>

Here is the output of your new monty version:

Dictionary
{'e': <E.A: 'A'>}, True: {'e': 'A'}
{'e': <E.A: 'A'>}, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}}
{'me': <ME.B: 'B'>}, True: {'me': 'B'}
{'me': <ME.B: 'B'>}, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}}

pydantic
<class '__main__.BM1'>, True: {'e': 'A', '@module': '__main__', '@class': 'BM1', '@version': None}
e=<E.A: 'A'>
<class '__main__.BM1'>, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}, '@module': '__main__', '@class': 'BM1', '@version': None}
{'me': <ME.B: 'B'>}, False: pydantic failed: 1 validation error for BM1
e
  Input should be 'A' [type=enum, input_value={'value': 'A', '@module':...: 'E', '@version': None}, input_type=dict]
<class '__main__.BM2'>, True: {'me': 'B', '@module': '__main__', '@class': 'BM2', '@version': None}
{'me': <ME.B: 'B'>}, True: pydantic failed: 1 validation error for BM2
me
  Value error, Must provide ME, the as_dict form, or the proper [type=value_error, input_value='B', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error
<class '__main__.BM2'>, False: {'me': {'@module': '__main__', '@class': 'ME', '@version': None, 'value': 'B'}, '@module': '__main__', '@class': 'BM2', '@version': None}
me=<ME.B: 'B'>

and just to hopefully make it clearer, here is the diff of the two outputs:

3c3
< {'e': <E.A: 'A'>}, False: failed
---
> {'e': <E.A: 'A'>}, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}}
10c10,13
< {'me': <ME.B: 'B'>}, False: jsanitize failed
---
> <class '__main__.BM1'>, False: {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}, '@module': '__main__', '@class': 'BM1', '@version': None}
> {'me': <ME.B: 'B'>}, False: pydantic failed: 1 validation error for BM1
> e
>   Input should be 'A' [type=enum, input_value={'value': 'A', '@module':...: 'E', '@version': None}, input_type=dict]

Considering that you propose to pass from jsanitize(strict=True, allow_bson=True, enum_values=True) to jsanitize(strict=True, allow_bson=True, enum_values=False), I can see some key differences

1) the jsanitize for {'e': E.A} that failed, now passes and is consistent with what the MontyEncoder does. Which is good.

2) The jsanitized document is also used for storing the output. This means that in all the cases we pass from a serialized version of the form {'e': 'A'} to one of the form {'e': {'value': 'A', '@module': '__main__', '@class': 'E', '@version': None}} in the output database. So, any mongodb query of the type {"e": "A"} would be broken and should be modified in {"e.value": "A"}. I am not sure how many such queries could be out there, but I suppose this could definitely break something.

3) In the case of BM1, going from enum_values=True to enum_values=False in the new version of monty also means that the pydantic model cannot be generated with model_validate from the serialized dictionary. In Jobflow MontyDecoder is used to deserialize, and thus keeps working correctly. So it is not strictly an issue for jobflow, but I am wondering if there are some cases where this could fail (there are in jobflow-remote, but I suppose can be fixed, if needed).

In general there is definitely a benefit to this, but some issues may arise.

I would also suggest to run the tests of emmet (and maybe maggma?) with the new monty version, to check that eveything works fine there as well.

FabiPi3 commented 3 months ago

@gpetretto Thanks for your detailed answer. I cannot really judge how large those issues would be, but in my opinion everything should be fixable.

Concerning the tests of emmet and maggma, everything that worked locally for me also worked with the new monty version. At least I couldn't see any issues arising there with the changes in monty. Hope this helps.

FabiPi3 commented 3 months ago

The pull request in monty was accepted. We could go ahead and merge this now. If I understand, there are no really big issues preventing this. Please comment again about your opinion.

Else I would close this (and adapt my application code). Thanks