Open jheeffer opened 4 years ago
We can make dataset_path
as a list, just like the PATH
enviornment variable, chef will search dataset names in these paths.
then if chef needs to download a dataset from internet, it download it to some path (etl/source or just a temporary path) and add it to the top of dataset_path
to make the dataset available
If I understand correctly, without --private
chef will directly modify base datasets, which seems dangerous to me.. I suggest that, if we don't specify version/commit for a local dir, we load it directly, if we specify version/commit, we do checkout/export with or without --private
. So --private
only affects the base datasets without a commit hash
Another suggestion: the output dir can default to etl/build
, possibly adding a subfolder with time stamp, to not mix output with old files
pull/fetch
from remote before buildTwo options:
e.g. git+open-numbers/ddf--gapminder--co2
--remote-head
use remote branch heads instead of local ones (pull or fetch && export before build)
+ path to repo is unambiguous and explicitly stated in recipe
+ allows building from index rather than commit when omitted (i.e. allows build from uncommitted changes)
- Another user's local repo may have a different remote
and thus builds can differ per machine, depending on repo config
- Another user's local repo on that path may be a completely different dataset and thus build fails
- The path in recipe should be used by all users of that recipe
- Different recipes may use different paths for the same dataset, leading to redundant dataset storage
e.g. git+ssh://github.com/open-numbers/ddf--gapminder--co2
--local-vcs-proxy
find local repos to use as proxy for the remote vcs (git) repos. If not found clone. Then pull or fetch && export.
How to find local repo based on remote
url? Either look for all local repos and compare remote
url (but what if multiple results?), or create a path out of remote
url: github.com/open-numbers/ddf--dataset => ./github.com/open-numbers/ddf--dataset/
using urlparse
and tldextract
+ recipe refers to a global url, one source of truth for all
+ works even if datasets are not yet locally available
- finding local repos either forces folder structure (remote url -> path) or can be ambiguous (find repo by remote url)
- can't build from uncommitted changes unless we add another option (maybe --no-remote-head
. Or default is --no-remote-head
and have to add --remote-head
when you want to fetch from remote first. Then it turns into a cache instead of a proxy. Maybe use cache DSL: --local-vcs-cache
and --cache-revalidate
. Though --cache-revalidate
doesn't make sense for relative path repos which you want to pull.)
Maybe we should support both options, but prefer remote urls in recipes.
Another suggestion: the output dir can default to
etl/build
, possibly adding a subfolder with time stamp, to not mix output with old files
That would mean you have to copy paste from build folder to root before commit and manually delete old files? I like the way it is now: remove old dataset and replace with new one. Or does it currently not clean and instead just overwrites files? Meaning output can be mixed with old files? It doesn't
You can of course set -o
to whatever you want. But your suggestion is the default of -o is ./etl/build/[%timestamp%]
instead of ./
, then?
We can make dataset_path as a list, just like the PATH enviornment variable, chef will search dataset names in these paths.
Sure, that's possible. Sounds like --find_links
in python install, which is a repeatable option. Have you had situations where you need multiple dataset paths?
then if chef needs to download a dataset from internet, it download it to some path (etl/source or just a temporary path) and add it to the top of dataset_path to make the dataset available
Sure, though that seems an implementation detail, not API design?
If I understand correctly, without --private chef will directly modify base datasets, which seems dangerous to me..
Yeah definitely dangerous. Can lead to all kinds of problems like the pull's merge not finishing (conflicts or plain error). And if you have changes in your working tree, git pull
does not reset them. So I guess you'd also need a git reset --hard
, which can be unwelcomely destructive ;).
I suggest that, if we don't specify version/commit for a local dir, we load it directly, if we specify version/commit, we do checkout/export with or without --private. So --private only affects the base datasets without a commit hash
I agree we should be able to build from the current state of the working tree, not just from a commit. Only building from commits makes recipe development unnecessarily hard.
I agree the absence of a specifier (branch/tag/ref/hash) can mean the repo as-is. The other option is it means the HEAD of the repo, i.e. ignoring uncommitted changes.
But, I'd say you should be able to do so éven if a branch, tag or ref is specified. Maybe even if a commit is specified? Systema Globalis would be based on ddf--open_numbers@master
, but I want to try my SG build before committing my changes to ddf--open_numbers
, or without having to remove the specifier in the recipe while developing (and then forgetting putting it back).
Maybe an option (--vcs-use-working-dir
) to use working tree of repos as-is?
I think it's good to have --private
as a default then, to prevent errors. Even if you use the working-dir as-is, a copy-paste may be good. For consistency and to prevent changes during build interfering.
But I feel you should be able use datasets as they are if you know what you're doing. Private mode takes a lot of time and disk space.
Option --no-private
: hard reset vcs datasets (or use in-place if --vcs-use-working-dir
) and use non-vcs datasets in-place?
What if dataset_path is a subfolder of the dataset root, e.g. the default ./etl/source/
? It feels redundant to then still copy paste datasets to another temp folder. On the one hand we can add a rule to not use temp
in this case for efficiency's sake. On the other.. adds "magic" and complexity.
Anyway, work in progress, let this marinade : )
We also had the idea to build a cache where we have a folder per commit hash. Refs (branch/tags) would be soft links to these folders. Would prevent errors by multiple builds using same shared repo, but still safe time & memory by datasets not having to export all their dependencies to their own temp folder before each build.
But invalidation/purging we didn't figure out yet.
I talked with Angie about this. Two points he made:
He has his own folder structure for datasets but still wants to benefit from using the local git repos as a proxy. To allow for this we could have a global configuration for the ddf
cli, which maps git urls to local folders. If ddf
can't find a dataset it can tell the user ddf
can download it for them or where to download it manually and edit this mapping file. Should be worked out more probably.
This could be a nice feature, but not for a first version.
He would prefer to save space and not copy/paste by default. But he would never want a build to reset his git working tree unexpectedly. So he proposed that when building through local git with uncommitted changes, don't build and show warning to use either --vcs-use-working-dir
flag or --private
.
This is a good option to consider.
We can make dataset_path as a list, just like the PATH enviornment variable, chef will search dataset names in these paths.
Sure, that's possible. Sounds like --find_links in python install, which is a repeatable option. Have you had situations where you need multiple dataset paths?
I haven't had such situations, but I meant that we now have 3 ways to access a dataset:
Different ways mean different place to store the datasets. So I think we need to support multiple dataset paths.
You can of course set -o to whatever you want. But your suggestion is the default of -o is ./etl/build/[%timestamp%] instead of ./, then?
I was thinking about that the chef and recipe is just like a build system, it is worth to check out existing systems (for example, cargo
for Rust and go
for Go). Usually bulid systems will have a build
command which builds and create intermediate/final artifacts and an install
command that installs final artifacts to target dir. That's why I think we should put the output to a build folder.
However I now think our situation is different, as our target dir is under git control (other build systems I checked install to folders like /usr/local/bin), it's easy to roll back. So it's ok to put default to ./
.
We will need a build folder to store intermediate outputs if we need to support incremental builds.
Maybe an option (--vcs-use-working-dir) to use working tree of repos as-is?
agreed.
I think it's good to have --private as a default then, to prevent errors. Even if you use the working-dir as-is, a copy-paste may be good.
agreed.
Build from shared but don't reset implicitly
He would prefer to save space and not copy/paste by default. But he would never want a build to reset his git working tree unexpectedly. So he proposed that when building through local git with uncommitted changes, don't build and show warning to use either --vcs-use-working-dir flag or --private
or suggest user to make a commit first.
What if dataset_path is a subfolder of the dataset root, e.g. the default ./etl/source/?
I guess this question should be left to the user, if no copy/paste is needed he can use --no-private
We can upload custom assets when creating a github release, so we can further decrease the network/disk usage by removing the etl folder from release
To solve the issue of local paths and recipe dependencies mapping, I suggest to add a datapackages.lock
file which contains additional informations about local repos/datasets and this file won't be commit to github as it contains local information
datapackages.lock
is a list of objects with following info:
And ddf run_recipe
will load dependencies based on datapackages.lock
. User can exam and modify datapackages.lock
to make sure the dependencies and local dir mapping is correct.
Also adding following commands might be good:
ddf resolve
: check if all datasets listed in recipe are available, if not, clone/download it, and then create the datapackages.lock
ddf get
: clone/download a dataset, and add it to datapackages.lock
ddf update
: update dependencies based on datapackages.lock
. If it's a local repo, do git pull, if not, download the target version. In https://github.com/semio/ddf_utils/issues/129#issuecomment-691116208
Maybe we should support both options, but prefer remote urls in recipes.
I am not sure why prefer remote urls in recipe, when we do programming we usually import things with its name right?
some notes on how to export with git
here are a few options I tried:
git archive
:
git checkout-index
:
git worktree add
git worktree prune
to cleanup)git clone --shared
:
I think the worktree
one (add a new worktree, then remove .git from new worktree folder and prune in origin repo) is the best option, because it don't require zip/unzip and don't require the target tag/commit to be checked out
All of above can't export the repo as-is, i.e. won't work when there are uncommitted changes. But instead of using git we can just copy all files, so it's not a problem
I have added some experimental features in recipe_build_tools
branch:
ddf get
and ddf install
:Please check this recording for a demo. https://asciinema.org/a/CYD8QrD3u93uYwhdRhd1skfd7
Notes:
The ideas mostly came from golang's package management solution, such as https://golang.org/doc/gopath_code.html#GOPATH and https://golang.org/ref/mod#versions
-i etl/recipes/etl.yaml
and -o ./
. So now you can run ddf run_recipe
in the project root.@latest
@latest
version) will be usewith this we have some basic traceability now, we know which version of dependencies are used in the build.
let's see what did we achieved with the latest commit (69661843262876ca98af49c4ef8490d2780ec05c)
In https://github.com/semio/ddf_utils/issues/129#issuecomment-690835612
Relative paths in recipe
supported in recipe and we don't need to add git+
prefix. Chef will detect which backend to use (currently only git supported)
--remote-head
not implemented so we need to run git pull manually.
Remote urls in recipe
supported in recipe and chef always use local repo (clone if it doesn't exist). --local-vcs-proxy
is not implemented yet
in https://github.com/semio/ddf_utils/issues/129#issuecomment-690869566
We also had the idea to build a cache where we have a folder per commit hash
Now we have $DATASET_DIR/pkg
for this propose, just that we need to ddf install
each commit. In airflow server we can automate the installing
in https://github.com/semio/ddf_utils/issues/129#issuecomment-691116208
Allow custom mapping from git url to local folder
Chef will load installed packages from $DATASET_DIR/pkg
, so we can have local git repos in any folder and run ddf install
to install them to $DATASET_DIR/pkg
.
This method doesn't have the flexibility of global configurations file, but it does allow people to have different folder structures
Build from shared but don't reset implicitly
Chef now supports the @latest
version which is a symlink to the local repo dir, but it won't do reset or raise error when the repo dir is not clean
Things not implemented:
--private
and other ddf run_recipe
cli options@jheeffer please have a try if you have time and see if you like this way of datapackage managing :)
Let's think of how to redesign the API of run_recipe. This is still very work-in-progress ; )
Goal is a user friendly API that allows you to:
git checkout
), or not wanting to pull/checkout a shared local git repo to get files you need for commit)builds dataset using recipe and other base datasets
-i <recipe_path>
defaults to./etl/recipes/etl.yml
or./etl/recipes/etl.yaml
-o <output_path>
defaults to./
--ddf_dir <dataset_path>
defaults to./../../
--private [<private_path>]
use private build environment. Copy any base datasets first to private folder--private_path
defaults to./etl/source/
maybe?--remote-head
use the head of theorigin
remote branch for any local git repo dataset.git pull
or when privategit fetch &&
export to different folder
This allows being in a dataset's root and just running
ddf run_recipe
to build the datasetBase dataset specifier format, used in recipes. Based on
pip install
. We don't have an index yet, so options are either a local path or git url. If either is relative, it's relative to<dataset_path>
paths:
urls (get resources from
./datapackage.json
, then download, though does not allow download of assets, etl folder, etc):git urls (
git+
allows extending functionality to other applications later. git protocols)Adding a branch name, a commit hash, a tag name or a git ref is possible like so:
Maybe we don't need
--private
if we define the base branches/commits in the recipe? I don't recall the exact problems we had in datakitchen that lead us to thinking we'd need it. Should look that up in chat logs.