CombinedData serialization with monty is broken

rkingsbury commented 3 years ago

Describe the bug I am unable to serialize and deserialize a CombinedData object using loadfn / dumpfn. I believe this occurs because CombinedData inherits from_dict from LammpsData, but CombinedData.init() takes different kwargs.

To Reproduce

Create a CombinedData object called data

from monty.serialization import loadfn, dumpfn

dumpfnn(data, 'some_path.json')
data = loadfn('some_path.json')

will fail with

------------------------------------------------------------------------
TypeError                              Traceback (most recent call last)
<ipython-input-26-b8f560ec17df> in <module>
      1 # data = LammpsData.from_file(pipeline_dir / '2021-05_prelim_lammps_TMA-Cl' / 'md.data')
----> 2 data = loadfn(pipeline_dir / '2021-05_prelim_lammps_TMA-Cl' / 'combined_data.json')

~/miniconda3/envs/md/lib/python3.9/site-packages/monty/serialization.py in loadfn(fn, *args, **kwargs)
     83                 if "cls" not in kwargs:
     84                     kwargs["cls"] = MontyDecoder
---> 85                 return json.load(fp, *args, **kwargs)
     86 
     87             raise TypeError("Invalid format: {}".format(fmt))

~/miniconda3/envs/md/lib/python3.9/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    291     kwarg; otherwise ``JSONDecoder`` is used.
    292     """
--> 293     return loads(fp.read(),
    294         cls=cls, object_hook=object_hook,
    295         parse_float=parse_float, parse_int=parse_int,

~/miniconda3/envs/md/lib/python3.9/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    357     if parse_constant is not None:
    358         kw['parse_constant'] = parse_constant
--> 359     return cls(**kw).decode(s)

~/miniconda3/envs/md/lib/python3.9/site-packages/monty/json.py in decode(self, s)
    381         """
    382         d = json.JSONDecoder.decode(self, s)
--> 383         return self.process_decoded(d)
    384 
    385 

~/miniconda3/envs/md/lib/python3.9/site-packages/monty/json.py in process_decoded(self, d)
    354                     data = {k: v for k, v in d.items() if not k.startswith("@")}
    355                     if hasattr(cls_, "from_dict"):
--> 356                         return cls_.from_dict(data)
    357             elif np is not None and modname == "numpy" and classname == "array":
    358                 if d["dtype"].startswith("complex"):

~/miniconda3/envs/md/code/pymatgen/pymatgen/io/lammps/data.py in from_dict(cls, d)
    917             topology = {k: decode_df(v) for k, v in topology.items()}
    918         items["topology"] = topology
--> 919         return cls(**items)
    920 
    921     def as_dict(self):

TypeError: __init__() got an unexpected keyword argument 'box'

Expected behavior The CombinedData object should be created from the file

Desktop (please complete the following information):

Windows 10 / WSL 2 Ubuntu
Version 2022.0.9

rkingsbury commented 3 years ago

FYI @htz1992213 @orioncohen

rkingsbury commented 3 years ago

Per a side discussion with @mkhorton , we might want to consider adding DataFrame support to monty so that we don't have to implement custom as_dict / from_dict. If we do that, just inheriting from MSONable and making sure that all kwargs are stored as class attributes should be enough. But currently MSONable doesn't support serializing DataFrame. We can discuss. See https://github.com/materialsproject/pymatgen/issues/2138

orionarcher commented 3 years ago

Thanks for bringing this up @rkingsbury. If I am reading this correctly, the method is throwing an error when you call data = loadfn('some_path.json'), right? What would adding DataFrame support to monty involve? Would that be as easy as importing a specialized package?

I support using monty to serialize without using custom as_dict and from_dict methods. My one concern is that serialization is not very persistent if we change the underlying object code. If we want to be able to store permanently in MongoDB, it might be wise to create a more flexible schema for storage.

htz1992213 commented 3 years ago

@rkingsbury Will make a patch overriding superclass method. I might set some method, e.g. from_ff_and_topologies, to be unsupported.

rkingsbury commented 3 years ago

Thanks for bringing this up @rkingsbury. If I am reading this correctly, the method is throwing an error when you call data = loadfn('some_path.json'), right? What would adding DataFrame support to monty involve? Would that be as easy as importing a specialized package?

I think what would be required is to add a type check to MontyEncoder, similar to what is currently done for datetime objects (here).

I'm a little out of my depth here, but from #2138 I know that this works:

df.to_json(default_handler=MontyEncoder().encode)

and loadfn uses MontyEncoder by default (as I understand it), so I think we would just need to add something to loadfn and dumpfn like:

if isinstance(o, pd.DataFrame):
    return o.to_json(default_handler=MontyEncoder().encode)

@mkhorton does that look right to you?

mkhorton commented 3 years ago

Hah, yes! I actually typed out a reply to this saying exactly that:

It would mean adding pandas to this block and then encoding the data frame using df.to_json(default_handler=MontyEncoder().encode). In this way, any data frame can be serialized automatically provided it contains only native or MSON data types, could be stored in Mongo, etc.

but forgot to submit it 🤦🏻‍♂️

Note that this is slightly different to your proposal just in that the code should be in MontyEncoder/MontyDecoder, rather than needing to touch loadfn or dumpfn.

rkingsbury commented 3 years ago

Thanks @mkhorton ! So something like

if modname and modname not in ["bson.objectid", "numpy"]:
     if modname == "pandas" and classname == "DataFrame":
            return= d.to_json(default_handler=MontyEncoder().encode)
     if modname == "pandas" and classname == "Series":
             return= d.to_json(default_handler=MontyEncoder().encode)

?

mkhorton commented 3 years ago

Yes, I believe so, but I haven't tried -- see here too. Note that including the @module and @class is important, and then the pandas json could be under a data key.

It's not clear to me if Series is necessary, or if additional pandas data types would also be useful, but sounds reasonable to me.

htz1992213 commented 3 years ago

@rkingsbury @mkhorton @orioncohen I have made a PR (https://github.com/materialsproject/pymatgen/pull/2191) for serialization using conventional encoding, but please let me know if the newer way works!

JaGeo commented 10 months ago

@rkingsbury this issue can be closed, right?

rkingsbury commented 10 months ago

@rkingsbury this issue can be closed, right?

I think so? It looks like #2191 should have taken care of it, although I have not personally tried serializing the returned dict

materialsproject / pymatgen

CombinedData serialization with monty is broken #2188