pachyderm / pachyderm

Data-Centric Pipelines and Data Versioning
https://www.pachyderm.com/
Apache License 2.0
6.17k stars 566 forks source link

Common Git patterns: cloning and pushing #5854

Open RaananHadar opened 3 years ago

RaananHadar commented 3 years ago

What is the goal / desired outcome? A killer capability of Pachyderm is being git for data. Pachyderm can really simplify getting the right data to the user and back while elegantly coping with complexities of cloud technologies. This has many benefits:

  1. Local storage volumes can be expensive on the cloud but simpler to use, data scientists tend to use them more.
  2. Object storage which although not a complex technology, is still left unused by many data scientists.
  3. Object storage can also vary between cloud vendors (for example you don't have s3 on Azure if you don't set up some special gateway with minio for example) and making your users use more object storage is usually a good idea.
  4. Many cloud providers provide temporary local volumes for machine learning applications. Making pulling the data from object storage a common pattern.

With that in mind, almost everybody knows git. And there are some finely grained day to day patterns that git users are accustomed to, which are not available in Pachyderm. Being able to have this will open many opportunities:

An easy example is cloning and pushing. Assuming that we have an images repo with 3 files:

/AT-AT.png    file 78.7KiB  
/kitten.png   file 102.4KiB 
/liberty.png file 57.27KiB

A user wants to get some data, doing a simple change and pushing back when finished.

Today the user needs to do:

pachctl get file -r images@master:/ -o .

I would envision a user doing the following:

pachctl clone images@master .

This is syntactic sugar, but is simpler and will be used quite frequently. Also, with a possible check against hashes, Pachyderm can be smart enough to pull only files that changed.

Lets assume that a user performed a change (I made a simple one, but it can be a complex one too!) the data now looks like this:

/AT-AT.png    file 78.7KiB  
/kitten.png   file 102.4KiB 
/liberty2.png file 57.27KiB

Now the user is satisfied with the changes and wants to put the data back in the repo as it is, similar to git. They want to create a new commit that only contains files that are pushed from a local directory. This can be done with the following commands:

pachctl start commit images@master
pachctl delete file images@master
pachctl put file -r images@master:/ -f .
pachctl finish commit images@master

Again, not as elegant. I really believe a lot of users would appreciate being able to do:

pachctl push . images@master

And again, Pachyderm can be smart enough to push only the changes.

Is there an alternative way to do this? yes. pachctl mount/unmount. Lets discuss it:

Overall, I think this has many advantages over pachctl mount, has a different use case and I really believe will see extremely frequent use.

echohack commented 3 years ago

@RaananHadar Thank you for opening this feedback! I think many of these suggestions are centered around making pachctl more usable. In particular I've seen the pachctl push feedback suggested before, so I'm glad you've captured it here in this issue.

If the team makes a usability pass at pachctl in the future, this feedback will be helpful. Thank you!

mindthevirt commented 3 years ago
Agent Dale Georg linked Freshdesk ticket 260 for this issue.