Improve API for run_recipe

semio / ddf_utils

Utilities for working with DDF datasets

https://open-numbers.github.io/

MIT License

2 stars 1 forks source link

Improve API for run_recipe #129

Open jheeffer opened 4 years ago

jheeffer commented 4 years ago

Let's think of how to redesign the API of run_recipe. This is still very work-in-progress ; )

Goal is a user friendly API that allows you to:

use a git repo as a dataset, local or remote
define a branch/commit for such git repo
use the remote head of a branch, through local repo to minimize network traffic. Maybe fetch & checkout to other folder. Or pull if it's fine to change local repo.
allow build from a local, private build environment, that's safe from interference in both ways (e.g. other process changes files during build (e.g. git checkout), or not wanting to pull/checkout a shared local git repo to get files you need for commit)

ddf run_recipe -o <output_path> --ddf_dir <dataset_path> -i <recipe_path> --show-tree -d -p --private --private_path <private_path> --remote-head

builds dataset using recipe and other base datasets

-i <recipe_path> defaults to ./etl/recipes/etl.yml or ./etl/recipes/etl.yaml -o <output_path> defaults to ./ --ddf_dir <dataset_path> defaults to ./../../ --private [<private_path>] use private build environment. Copy any base datasets first to private folder --private_path defaults to ./etl/source/ maybe? --remote-head use the head of the origin remote branch for any local git repo dataset. git pull or when private git fetch && export to different folder

This allows being in a dataset's root and just running ddf run_recipe to build the dataset

Base dataset specifier format, used in recipes. Based on pip install. We don't have an index yet, so options are either a local path or git url. If either is relative, it's relative to <dataset_path>

paths:

./open-numbers/ddf--gapminder--wdi
justonefolder

urls (get resources from ./datapackage.json, then download, though does not allow download of assets, etl folder, etc):

https://raw.githubusercontent.com/open-numbers/ddf--gapminder--co2_emission/autogenerated/
http://example.com/dataset/

git urls (git+ allows extending functionality to other applications later. git protocols)

git+ssh://github.com/open-numbers/ddf--gapminder--wdi
git+git@github.com:open-numbers/ddf--gapminder--wdi (scp style syntax for ssh)
git+https://github.com/open-numbers/ddf--gapminder--wdi
git+file:///home/user/git/open-numbers/ddf--gapminder--wdi
git+open-numbers/ddf--gapminder--wdi

Adding a branch name, a commit hash, a tag name or a git ref is possible like so:

git+https://github.com/open-numbers/ddf--gapminder--wdi@master
git+https://github.com/open-numbers/ddf--gapminder--wdi@v1.0
git+https://github.com/open-numbers/ddf--gapminder--wdi@da39a3ee5e6b4b0d3255bfef95601890afd80709
git+https://github.com/open-numbers/ddf--gapminder--wdi@refs/pull/123/head
git+open-numbers/ddf--gapminder--wdi@master

Maybe we don't need --private if we define the base branches/commits in the recipe? I don't recall the exact problems we had in datakitchen that lead us to thinking we'd need it. Should look that up in chat logs.

semio commented 4 years ago

We can make dataset_path as a list, just like the PATH enviornment variable, chef will search dataset names in these paths.

then if chef needs to download a dataset from internet, it download it to some path (etl/source or just a temporary path) and add it to the top of dataset_path to make the dataset available

If I understand correctly, without --private chef will directly modify base datasets, which seems dangerous to me.. I suggest that, if we don't specify version/commit for a local dir, we load it directly, if we specify version/commit, we do checkout/export with or without --private. So --private only affects the base datasets without a commit hash

semio commented 4 years ago

Another suggestion: the output dir can default to etl/build, possibly adding a subfolder with time stamp, to not mix output with old files

jheeffer commented 4 years ago

Feature: Local git repos which `pull/fetch` from remote before build

Two options:

1. Relative paths in recipe

e.g. git+open-numbers/ddf--gapminder--co2 --remote-head use remote branch heads instead of local ones (pull or fetch && export before build)

+ path to repo is unambiguous and explicitly stated in recipe + allows building from index rather than commit when omitted (i.e. allows build from uncommitted changes) - Another user's local repo may have a different remote and thus builds can differ per machine, depending on repo config - Another user's local repo on that path may be a completely different dataset and thus build fails - The path in recipe should be used by all users of that recipe - Different recipes may use different paths for the same dataset, leading to redundant dataset storage

2. Remote urls in recipe

e.g. git+ssh://github.com/open-numbers/ddf--gapminder--co2 --local-vcs-proxy find local repos to use as proxy for the remote vcs (git) repos. If not found clone. Then pull or fetch && export. How to find local repo based on remote url? Either look for all local repos and compare remote url (but what if multiple results?), or create a path out of remote url: github.com/open-numbers/ddf--dataset => ./github.com/open-numbers/ddf--dataset/ using urlparse and tldextract

+ recipe refers to a global url, one source of truth for all + works even if datasets are not yet locally available - finding local repos either forces folder structure (remote url -> path) or can be ambiguous (find repo by remote url) - can't build from uncommitted changes unless we add another option (maybe --no-remote-head. Or default is --no-remote-head and have to add --remote-head when you want to fetch from remote first. Then it turns into a cache instead of a proxy. Maybe use cache DSL: --local-vcs-cache and --cache-revalidate. Though --cache-revalidate doesn't make sense for relative path repos which you want to pull.)

Maybe we should support both options, but prefer remote urls in recipes.

jheeffer commented 4 years ago

Another suggestion: the output dir can default to etl/build, possibly adding a subfolder with time stamp, to not mix output with old files

That would mean you have to copy paste from build folder to root before commit and manually delete old files? ~~I like the way it is now: remove old dataset and replace with new one. Or does it currently not clean and instead just overwrites files? Meaning output can be mixed with old files?~~ It doesn't You can of course set -o to whatever you want. But your suggestion is the default of -o is ./etl/build/[%timestamp%] instead of ./, then?

We can make dataset_path as a list, just like the PATH enviornment variable, chef will search dataset names in these paths.

Sure, that's possible. Sounds like --find_links in python install, which is a repeatable option. Have you had situations where you need multiple dataset paths?

then if chef needs to download a dataset from internet, it download it to some path (etl/source or just a temporary path) and add it to the top of dataset_path to make the dataset available

Sure, though that seems an implementation detail, not API design?

If I understand correctly, without --private chef will directly modify base datasets, which seems dangerous to me..

Yeah definitely dangerous. Can lead to all kinds of problems like the pull's merge not finishing (conflicts or plain error). And if you have changes in your working tree, git pull does not reset them. So I guess you'd also need a git reset --hard, which can be unwelcomely destructive ;).

I suggest that, if we don't specify version/commit for a local dir, we load it directly, if we specify version/commit, we do checkout/export with or without --private. So --private only affects the base datasets without a commit hash

I agree we should be able to build from the current state of the working tree, not just from a commit. Only building from commits makes recipe development unnecessarily hard. I agree the absence of a specifier (branch/tag/ref/hash) can mean the repo as-is. The other option is it means the HEAD of the repo, i.e. ignoring uncommitted changes. But, I'd say you should be able to do so éven if a branch, tag or ref is specified. Maybe even if a commit is specified? Systema Globalis would be based on ddf--open_numbers@master, but I want to try my SG build before committing my changes to ddf--open_numbers, or without having to remove the specifier in the recipe while developing (and then forgetting putting it back). Maybe an option (--vcs-use-working-dir) to use working tree of repos as-is?

I think it's good to have --private as a default then, to prevent errors. Even if you use the working-dir as-is, a copy-paste may be good. For consistency and to prevent changes during build interfering. But I feel you should be able use datasets as they are if you know what you're doing. Private mode takes a lot of time and disk space. Option --no-private: hard reset vcs datasets (or use in-place if --vcs-use-working-dir) and use non-vcs datasets in-place?

What if dataset_path is a subfolder of the dataset root, e.g. the default ./etl/source/? It feels redundant to then still copy paste datasets to another temp folder. On the one hand we can add a rule to not use temp in this case for efficiency's sake. On the other.. adds "magic" and complexity.

Anyway, work in progress, let this marinade : )

jheeffer commented 4 years ago

We also had the idea to build a cache where we have a folder per commit hash. Refs (branch/tags) would be soft links to these folders. Would prevent errors by multiple builds using same shared repo, but still safe time & memory by datasets not having to export all their dependencies to their own temp folder before each build.

But invalidation/purging we didn't figure out yet.

jheeffer commented 4 years ago

I talked with Angie about this. Two points he made:

Allow custom mapping from git url to local folder

He has his own folder structure for datasets but still wants to benefit from using the local git repos as a proxy. To allow for this we could have a global configuration for the ddf cli, which maps git urls to local folders. If ddf can't find a dataset it can tell the user ddf can download it for them or where to download it manually and edit this mapping file. Should be worked out more probably.

This could be a nice feature, but not for a first version.

Build from shared but don't reset implicitly

He would prefer to save space and not copy/paste by default. But he would never want a build to reset his git working tree unexpectedly. So he proposed that when building through local git with uncommitted changes, don't build and show warning to use either --vcs-use-working-dir flag or --private.

This is a good option to consider.

semio commented 4 years ago

We can make dataset_path as a list, just like the PATH enviornment variable, chef will search dataset names in these paths.

Sure, that's possible. Sounds like --find_links in python install, which is a repeatable option. Have you had situations where you need multiple dataset paths?

I haven't had such situations, but I meant that we now have 3 ways to access a dataset:

we download a version of dataset, without a local git repo
we have a local dataset repo, and load it as-is
we checkout a commit from a local repo
(also, people might have different folder structures like Angie does)

Different ways mean different place to store the datasets. So I think we need to support multiple dataset paths.

You can of course set -o to whatever you want. But your suggestion is the default of -o is ./etl/build/[%timestamp%] instead of ./, then?

I was thinking about that the chef and recipe is just like a build system, it is worth to check out existing systems (for example, cargo for Rust and go for Go). Usually bulid systems will have a build command which builds and create intermediate/final artifacts and an install command that installs final artifacts to target dir. That's why I think we should put the output to a build folder.

However I now think our situation is different, as our target dir is under git control (other build systems I checked install to folders like /usr/local/bin), it's easy to roll back. So it's ok to put default to ./.

We will need a build folder to store intermediate outputs if we need to support incremental builds.

Maybe an option (--vcs-use-working-dir) to use working tree of repos as-is?

agreed.

I think it's good to have --private as a default then, to prevent errors. Even if you use the working-dir as-is, a copy-paste may be good.

agreed.

Build from shared but don't reset implicitly

He would prefer to save space and not copy/paste by default. But he would never want a build to reset his git working tree unexpectedly. So he proposed that when building through local git with uncommitted changes, don't build and show warning to use either --vcs-use-working-dir flag or --private

or suggest user to make a commit first.

What if dataset_path is a subfolder of the dataset root, e.g. the default ./etl/source/?

I guess this question should be left to the user, if no copy/paste is needed he can use --no-private

semio commented 4 years ago

Option to output a zipped file containing the dataset, excluding the etl folder

We can upload custom assets when creating a github release, so we can further decrease the network/disk usage by removing the etl folder from release

semio commented 4 years ago

Utilities to manage dependencies

To solve the issue of local paths and recipe dependencies mapping, I suggest to add a datapackages.lock file which contains additional informations about local repos/datasets and this file won't be commit to github as it contains local information

datapackages.lock is a list of objects with following info:

name of dependency
git url
local path
revision

And ddf run_recipe will load dependencies based on datapackages.lock. User can exam and modify datapackages.lock to make sure the dependencies and local dir mapping is correct.

Also adding following commands might be good:

ddf resolve: check if all datasets listed in recipe are available, if not, clone/download it, and then create the datapackages.lock
ddf get: clone/download a dataset, and add it to datapackages.lock
ddf update: update dependencies based on datapackages.lock. If it's a local repo, do git pull, if not, download the target version.

semio commented 4 years ago

In https://github.com/semio/ddf_utils/issues/129#issuecomment-691116208

Maybe we should support both options, but prefer remote urls in recipes.

I am not sure why prefer remote urls in recipe, when we do programming we usually import things with its name right?

semio commented 3 years ago

some notes on how to export with git

here are a few options I tried:

git archive:
- + output not including the .git folder
- + output specific commit without changing the index
- - only output zipped format, thus hurts performance
git checkout-index:
- + output not including the .git folder
- - can only output the index, so can not output other commits without git checkout
git worktree add
- - output specific commit without changing the index
- - output includes a .git file,
- + .git is just a small text file
- - if we remove .git file from the new worktree, we would still see the new worktree in origin repo's worktree list. (need to use git worktree prune to cleanup)
git clone --shared:
- + output specific commit without changing the index
- - output includes the .git folder,
- + the objects are symlinks to the origin repo. So the size of .git folder is very small
- - can only export the branches already checked out once.

I think the worktree one (add a new worktree, then remove .git from new worktree folder and prune in origin repo) is the best option, because it don't require zip/unzip and don't require the target tag/commit to be checked out

All of above can't export the repo as-is, i.e. won't work when there are uncommitted changes. But instead of using git we can just copy all files, so it's not a problem

semio commented 3 years ago

I have added some experimental features in recipe_build_tools branch:

`ddf get` and `ddf install`:

Please check this recording for a demo. https://asciinema.org/a/CYD8QrD3u93uYwhdRhd1skfd7

Notes:

Repositories can be in anywhere, not limited to $DATASET_DIR/repo
the pseudo version string consists of
1. the most recent tag before the commit. If no tags before the commit, it will be v0.0.0
2. the time of the commit
3. the hash of the commit
So, given a commit hash, the pseudo version is a fixed string
If the repo is not under version control, the pseudo version number will be v0.0.0-{current_time}+no_vcs

The ideas mostly came from golang's package management solution, such as https://golang.org/doc/gopath_code.html#GOPATH and https://golang.org/ref/mod#versions

changes in Chef

run_recipe default options: -i etl/recipes/etl.yaml and -o ./. So now you can run ddf run_recipe in the project root.
it reads datasets from $DATASET_DIR/pkg/
- you can now set version in ingredients to use a specific version of a dataset
- if you want to use the repo as is, set the version to @latest
- if you don't set version, the latest installed version (sorted by commit time, excluding the @latest version) will be use
(automatically install the target version ~is not implemented yet ,but should be easy to add~ implemented)

with this we have some basic traceability now, we know which version of dependencies are used in the build.

semio commented 3 years ago

let's see what did we achieved with the latest commit (69661843262876ca98af49c4ef8490d2780ec05c)

In https://github.com/semio/ddf_utils/issues/129#issuecomment-690835612

Relative paths in recipe

supported in recipe and we don't need to add git+ prefix. Chef will detect which backend to use (currently only git supported)

--remote-head not implemented so we need to run git pull manually.

Remote urls in recipe

supported in recipe and chef always use local repo (clone if it doesn't exist). --local-vcs-proxy is not implemented yet

in https://github.com/semio/ddf_utils/issues/129#issuecomment-690869566

We also had the idea to build a cache where we have a folder per commit hash

Now we have $DATASET_DIR/pkg for this propose, just that we need to ddf install each commit. In airflow server we can automate the installing

in https://github.com/semio/ddf_utils/issues/129#issuecomment-691116208

Allow custom mapping from git url to local folder

Chef will load installed packages from $DATASET_DIR/pkg, so we can have local git repos in any folder and run ddf install to install them to $DATASET_DIR/pkg.

This method doesn't have the flexibility of global configurations file, but it does allow people to have different folder structures

Build from shared but don't reset implicitly

Chef now supports the @latest version which is a symlink to the local repo dir, but it won't do reset or raise error when the repo dir is not clean

Things not implemented:

--private and other ddf run_recipe cli options
run git pull/fetch before running recipe
download resources from github directly

@jheeffer please have a try if you have time and see if you like this way of datapackage managing :)

semio / ddf_utils

Improve API for run_recipe #129

Feature: Local git repos which pull/fetch from remote before build

1. Relative paths in recipe

2. Remote urls in recipe

Allow custom mapping from git url to local folder

Build from shared but don't reset implicitly

Option to output a zipped file containing the dataset, excluding the etl folder

Utilities to manage dependencies

ddf get and ddf install:

changes in Chef

Feature: Local git repos which `pull/fetch` from remote before build

`ddf get` and `ddf install`: