hackalog / easydata

A flexible template for doing reproducible data science in Python.
MIT License
105 stars 22 forks source link

Update to Easydata 2 #225

Closed hackalog closed 2 years ago

hackalog commented 3 years ago

Removed many generic targets from the makefile (fetch_*, unpack_*, transform_*, process_*). Now support much clearer make datasets (make data) and make datasources (make raw)

Introduced notebooks-as-transformers. This is super cool. So long as it creates a DataSource object in the correct location, a notebook can be used as a transformer in the DatasetGraph. See helpers.notebook_as_transformer()

Completely changed the catalog serialization to be more git-friendly, introducing the Catalog object. A Catalog is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.

Changes to file layout:

Updated the sample notebooks and framework documentation to use the new APIs.

Introduced easydata-specific exceptions:

Removed most of src.workflow which was a temporary way to paper over API issues. Some of it moved to src.helpers (dataset creation helper functions)

Renamed TransformerGraph -> DatasetGraph. Since Datasets are the "nodes" of this hypergraph, it's a more natural way to talk about it.

Cleaned up Dataset generation in the DatasetGraph. In some cases, a dataset needed to be generated twice. This is now fixed.

Deprecated most of the bare methods in src.data. These are now exposed via the Dataset, DatasetGraph, DataSource, and Catalog objects. See the API Changes blog posts for details.

Virtually all the force options in method calls have been renamed. Confusion over the meanings of these flags was a rich source of bugs.

Renamed: create_transformer_pipeline -> serialize_transformer_pipeline

Removed src.log.debug, as it did not work as intended. Set LOGLEVEL environment variable instead.

Added a "symlink" unpack method to DataSource objects.

Restructired src.utils; e.g.

Todo: