Update to Easydata 2 - Githubissues

Removed many generic targets from the makefile (fetch_*, unpack_*, transform_*, process_*). Now support much clearer make datasets (make data) and make datasources (make raw)

Introduced notebooks-as-transformers. This is super cool. So long as it creates a DataSource object in the correct location, a notebook can be used as a transformer in the DatasetGraph. See helpers.notebook_as_transformer()

Completely changed the catalog serialization to be more git-friendly, introducing the Catalog object. A Catalog is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.

serializable means anything stored in the catalog must be serializable to/from JSON.
disk-backed means all changes are reflected immediately in the on-disk serialization.
git-friendly means this on-disk format can be easily maintained in a git repo (with minimal issues around merge conflicts), and
dict-like means programmatically, it acts like a Python dict See the Catalog blog post for more.

Changes to file layout:

Renamed references->reference. This is for data dictionaries, documentation, manuals, scripts, papers, or other explanatory materials. In particular:
- reference/easydata: Easydata framework and workflow documentation. Formerly framework-docs
- reference/templates: Templates and code snippets for Jupyter
- reference/dataset: resources related to datasets; e.g. dataset creation notebooks and scripts
New entries to src.paths:
cache_path (Default: data/interim/cache)
notebook_path (Default: notebooks)
output_path (Default: reports)
figures_path (Default: reports/figures)
template_path (Default: reference/templates)

Updated the sample notebooks and framework documentation to use the new APIs.

Introduced easydata-specific exceptions:

EasydataError: base for all other exceptions
ValidationError: hash check failed
ObjectCollision: object already exists in object store (more general than a FileExistsError)
NotFoundError: object not found in object store (more general than a FileNotFound Error)

Removed most of src.workflow which was a temporary way to paper over API issues. Some of it moved to src.helpers (dataset creation helper functions)

Renamed TransformerGraph -> DatasetGraph. Since Datasets are the "nodes" of this hypergraph, it's a more natural way to talk about it.

Cleaned up Dataset generation in the DatasetGraph. In some cases, a dataset needed to be generated twice. This is now fixed.

Deprecated most of the bare methods in src.data. These are now exposed via the Dataset, DatasetGraph, DataSource, and Catalog objects. See the API Changes blog posts for details.

Virtually all the force options in method calls have been renamed. Confusion over the meanings of these flags was a rich source of bugs.

Renamed: create_transformer_pipeline -> serialize_transformer_pipeline

Removed src.log.debug, as it did not work as intended. Set LOGLEVEL environment variable instead.

Added a "symlink" unpack method to DataSource objects.

Restructired src.utils; e.g.

Added ipnbname functions to determine notebook name (when Jupyter kernel is running)
Added run_notebook wrapper

Todo:

replace datset-test.json with one with a better license.

hackalog / easydata

Update to Easydata 2 #225