itamarst / eliot

Eliot: the logging system that tells you *why* it happened
https://eliot.readthedocs.io
Apache License 2.0
1.1k stars 66 forks source link

Feature Request: Support for Pandas DataFrames #453

Open marwan116 opened 4 years ago

marwan116 commented 4 years ago

Thank you for open-sourcing the eliot logging library.

I have a question about the decision to use JSON to serialize the logs - specifically when it comes to scientific computing. Trying to use a pandas object as an argument results in Object of type DataFrame is not JSON serializable - however, had the choice been made to use YAML then this would not have been an issue.

Can you shed some light on the necessity of using JSON vs YAML for eliot's purposes - and what do you think about using YAML instead?

marwan116 commented 4 years ago

As a followup - looking at the implementation of to_file. I see to_file(output_file, encoder=EliotJSONEncoder) - would the change be as simple as creating and using "EliotYAMLEncoder" here?

itamarst commented 4 years ago

YAML doesn't magically enable Pandas DataFrames. The default Python YAML library will (de)serialize arbitrary objects, but that's insecure, at least for deserialization (the safe_* variants won't do that for that reason). So I recommend against it.

Some options:

  1. Eliot does have pluggable serializers for the JSON destination (it's how it serializes NumPy to JSON). I've already considered adding support for Pandas, so I will try to do that sometime soon.
  2. You can also plug in your own serialization system by adding a custom destination, an arbitrary function that can do anything it want with logged messages: https://eliot.readthedocs.io/en/stable/outputting/output.html#configuring-logging-output You could write one that opens a file and writes out YAML if you wish.
marwan116 commented 4 years ago

"YAML doesn't magically enable Pandas DataFrames. The default Python YAML library will (de)serialize arbitrary objects, but that's insecure, at least for deserialization (the safe_* variants won't do that for that reason). So I recommend against it."

agreed, I usually use the yamlable library to wrap any object that is meant to be serialized by yaml - however one can argue for purposes when all YAML objects are locally created by the user then this security issue is less of a concern when it comes to deserialization ...

(re:yamlable: https://smarie.github.io/python-yamlable/) most of my use-cases involving pandas: it is a class that makes use of a pandas dataframe, or extends a pandas dataframe ... )

  1. "Eliot does have pluggable serializers for the JSON destination (it's how it serializes NumPy to JSON). I've already considered adding support for Pandas, so I will try to do that sometime soon." that's great to hear

  2. "You can also plug in your own serialization system by adding a custom destination, an arbitrary function that can do anything it want with logged messages: https://eliot.readthedocs.io/en/stable/outputting/output.html#configuring-logging-output You could write one that opens a file and writes out YAML if you wish." Thank you so much for this suggestion - I will attempt to create a custom destination then

marwan116 commented 4 years ago

Sorry I recognize this is probably a question better raised on ‘eliot-tree’ but if one uses a custom destination to a yaml file - would Eliot-tree also accept a custom deserializer ?

itamarst commented 4 years ago

Not sure, it's a different maintainer. FWIW I suggest option #1 is better: it'll Just Work with eliot-tree, and it's not very hard to do. Here's what the NumPy code looks like: https://github.com/itamarst/eliot/blob/master/eliot/json.py#L15

You'd just need to add another if statement or two there that converts a DataFrame/Series to Python objects.