Removed many generic targets from the makefile (fetch_*, unpack_*, transform_*, process_*).
Now support much clearer make datasets (make data) and make datasources (make raw)
Introduced notebooks-as-transformers. This is super cool. So long as it creates a DataSource object in the correct location, a notebook can be used as a transformer in the DatasetGraph. See helpers.notebook_as_transformer()
Completely changed the catalog serialization to be more git-friendly, introducing the Catalog object.
A Catalog is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.
serializable means anything stored in the catalog must be serializable to/from JSON.
disk-backed means all changes are reflected immediately in the on-disk serialization.
git-friendly means this on-disk format can be easily maintained in a git repo (with minimal
issues around merge conflicts), and
dict-like means programmatically, it acts like a Python dict
See the Catalog blog post for more.
Changes to file layout:
Renamed references->reference. This is for data dictionaries, documentation, manuals, scripts, papers, or other explanatory materials. In particular:
reference/easydata: Easydata framework and workflow documentation. Formerly framework-docs
reference/templates: Templates and code snippets for Jupyter
reference/dataset: resources related to datasets; e.g. dataset creation notebooks and scripts
ObjectCollision: object already exists in object store (more general than a FileExistsError)
NotFoundError: object not found in object store (more general than a FileNotFound Error)
Removed most of src.workflow which was a temporary way to paper over API issues.
Some of it moved to src.helpers (dataset creation helper functions)
Renamed TransformerGraph -> DatasetGraph. Since Datasets are the "nodes" of this hypergraph,
it's a more natural way to talk about it.
Cleaned up Dataset generation in the DatasetGraph. In some cases, a dataset needed to be generated twice. This is now fixed.
Deprecated most of the bare methods in src.data. These are now exposed via the Dataset, DatasetGraph, DataSource, and Catalog objects. See the API Changes blog posts for details.
Virtually all the force options in method calls have been renamed. Confusion over the meanings
of these flags was a rich source of bugs.
Removed many generic targets from the makefile (
fetch_*
,unpack_*
,transform_*
,process_*
). Now support much clearermake datasets (make data)
andmake datasources (make raw)
Introduced notebooks-as-transformers. This is super cool. So long as it creates a
DataSource
object in the correct location, a notebook can be used as a transformer in theDatasetGraph
. Seehelpers.notebook_as_transformer()
Completely changed the catalog serialization to be more git-friendly, introducing the Catalog object. A
Catalog
is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.dict
See the Catalog blog post for more.Changes to file layout:
Renamed references->reference. This is for data dictionaries, documentation, manuals, scripts, papers, or other explanatory materials. In particular:
reference/easydata
: Easydata framework and workflow documentation. Formerlyframework-docs
reference/templates
: Templates and code snippets for Jupyterreference/dataset
: resources related to datasets; e.g. dataset creation notebooks and scriptsNew entries to
src.paths
:cache_path (Default:
data/interim/cache
)notebook_path (Default:
notebooks
)output_path (Default:
reports
)figures_path (Default:
reports/figures
)template_path (Default:
reference/templates
)Updated the sample notebooks and framework documentation to use the new APIs.
Introduced easydata-specific exceptions:
Removed most of
src.workflow
which was a temporary way to paper over API issues. Some of it moved tosrc.helpers
(dataset creation helper functions)Renamed
TransformerGraph
->DatasetGraph
. Since Datasets are the "nodes" of this hypergraph, it's a more natural way to talk about it.Cleaned up
Dataset
generation in theDatasetGraph
. In some cases, a dataset needed to be generated twice. This is now fixed.Deprecated most of the bare methods in
src.data
. These are now exposed via theDataset
,DatasetGraph
,DataSource
, andCatalog
objects. See the API Changes blog posts for details.Virtually all the
force
options in method calls have been renamed. Confusion over the meanings of these flags was a rich source of bugs.Renamed:
create_transformer_pipeline
->serialize_transformer_pipeline
Removed
src.log.debug
, as it did not work as intended. Set LOGLEVEL environment variable instead.Added a "symlink" unpack method to
DataSource
objects.Restructired
src.utils
; e.g.ipnbname functions
to determine notebook name (when Jupyter kernel is running)run_notebook
wrapperTodo: