openscm / scmdata

Handling of Simple Climate Model data
https://scmdata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Serialisation #243

Open znichollscr opened 1 year ago

znichollscr commented 1 year ago

Is your feature request related to a problem? Please describe.

Sometimes it would be handy to be able to serialise an ScmRun instance into common formats (e.g. yaml, json). While these formats aren't that performant, in some cases the standardisation is more important than the performance.

Describe the solution you'd like

It would be great to have some serialisation/structuring and unstructuring functionality shipped with scmdata. If we moved to making ScmRun an attrs class, we could get it almost for free using cattrs (see below).

Describe alternatives you've considered

I haven't considered that many alternatives. I was also thinking about how to do this better (e.g. a datapackage style approach) but I think it is actually much better and more explicit to have something like a FileBackedDataStore which is composed of an ScmRun object (the data) and a file path (where to serialise data to and deserialise from) instead of trying to magically pick places on disk to serialise/deserialise. Maybe I am wrong about that and we could just introduce a default path e.g. ~/.scmdata_cache which could be used instead.

Additional context

From another project, I have used this code to make ScmRun objects convertable to yaml. It is a hack, but maybe a start.

def remove_np_str(in_v):
    # Urgh, so slow
    if isinstance(in_v, np.str_):
        return str(in_v)

    if isinstance(in_v, tuple):
        return tuple(remove_np_str(v) for v in in_v)

    if isinstance(in_v, list):
        return [remove_np_str(v) for v in in_v]

    if isinstance(in_v, (int, float, str)):
        return in_v

def unstructure_base_scm_run(in_run: scmdata.run.BaseScmRun) -> Dict[str, Any]:
    ts = in_run.timeseries(time_axis="year")

    out = {}
    for k, v in ts.to_dict("tight").items():
        if isinstance(v, list):
            v_conv = [remove_np_str(vv) for vv in v]
        else:
            raise NotImplementedError(v)

        out[k] = v_conv

    return out

def structure_base_scm_run(values: Dict[str, Any], other) -> scmdata.run.BaseScmRun:
    ts = pd.DataFrame.from_dict(values, orient="tight")
    out = scmdata.run.BaseScmRun(ts)

    return out

converter_yaml.register_unstructure_hook(scmdata.run.BaseScmRun, unstructure_base_scm_run)
converter_yaml.register_structure_hook(
    scmdata.run.BaseScmRun, structure_base_scm_run
)
lewisjared commented 1 year ago

I've serialized ScmRun objects as JSON using the following function in the past i.e. live.magicc.org. If we do support a JSON format I'd have a strong preference to be consistent it.

def _run_to_records_index(data: scmdata.ScmRun, time_axis="year") -> dict[str, Any]:
    ts = data.timeseries(time_axis=time_axis)
    ts = ts.astype("object").where(pd.notnull(ts), None)
    timeseries = ts.to_dict(orient="records")
    headers = ts.index.to_frame()
    headers = (
        headers.astype("object")
        .where(pd.notnull(headers), None)
        .to_dict(orient="records")
    )

    return [{"columns": h, "data": t} for h, t in zip(headers, timeseries)]