Pachyderm File System Source Type for KVC

dwhitena commented 6 years ago

As discussed in our recent meeting, https://github.com/kubeflow/kubeflow/issues/151#issuecomment-371628634 requires a way to expose data from Pachyderm to a TFJob. Moreover, this type of data access pattern would be useful for integrating any distributed training framework (e.g., SparkML) or other resource into a Pachyderm pipeline.

In our discussion, we proposed creating a source type for exposing data from the versioned Pachyderm file system (which is backed by an object store). I suggest this format:

apiVersion: kvc.kubeflow.org/v1
kind: VolumeManager
metadata:
  name: kvc-example1
  namespace: <insert-namespace-here>
spec:
  volumeConfigs:
    - id: "vol1"
      replicas: 1
      sourceType: "PFS"
      sourceRepo: <insert input repo name or names here>
      sourceBranch: <insert input repo branch here, e.g., "master">
      accessMode: "ReadWriteOnce"
      capacity: 5Gi
      labels:
        key1: val1
        key2: val2
      options:
        pachSecretName: <insert-secret-name-for-pach-auth-and-host>

This would allow the connector to utilize the Pachyderm client to pull the necessary data into the volume.

balajismaniam commented 6 years ago

@dwhitena Thanks for opening the issue.

As a first cut let's do the following: On addition of CR: The data from the specified repo and branch is downloaded and replicated among different nodes in the cluster. On deletion of the CR: The data is cleaned-up from the nodes.

Later on we can add the ability to update the data on update of a CR. The first implementation for handling updates could be just re-downloading the data from the specified repo and branch. Later on we can optimization to download only the diff if possible and patch the data already available in each of the nodes.

WDYT?

bhack commented 6 years ago

/cc @jlewi

jlewi commented 6 years ago

In TF one of the most common patterns for distributed training is

all workers read the data (e.g. worker processes a different subset)
one worker writes the model/checkpoints

So is the idea with KVC that the entire dataset is replicated across multiple nodes and there is a worker on each node?

In Cloud, I think we would want to put the data on PDs and move the PDs around. Does KVC support that or does KVC assume you are storing it on host nodes?

jlewi commented 6 years ago

Do we need a corresponding issue for a sink?

bhack commented 6 years ago

Also what about frameworks internal cache on fs? Is this an alternative? How this will interact with https://github.com/kubeflow/experimental-kvc/issues/26 if we need to split a large network fs dataset in shards for each worker?

dwhitena commented 6 years ago

Great questions @jlewi. I would be open to various options regarding how the data is spread across the nodes. In a first case (to illustrate things most simply), I think we should show distributing it to every node, and then maybe optimize.

@balajismaniam Your approach sounds good. One thing that I would like to emphasize is that we will likely want to get data back into pachyderm during the "clean-up," we are thinking of doing this in an Argo workflow (that would also manage the KVC I think). Anyway, something to keep in mind.

balajismaniam commented 6 years ago

@jlewi

So is the idea with KVC that the entire dataset is replicated across multiple nodes and there is a worker on each node?

Yes. KVC can be used in this way.

In Cloud, I think we would want to put the data on PDs and move the PDs around. Does KVC support that or does KVC assume you are storing it on host nodes?

KVC can use any of the volume sources available in Kubernetes including GCE PDs. It doesn't have any assumptions based on hosting it in the nodes. We already support NFS for example: https://github.com/kubeflow/experimental-kvc/tree/master/resources/customresources/nfs.

But caching the data locally on host nodes seems to be the most popular reason why people use KVC.

Do we need a corresponding issue for a sink?

Sure. Please open it if it needs to be taken care separately. Otherwise, we can handle this as a part of this issue.

balajismaniam commented 6 years ago

@dwhitena Makes sense. During clean-up, we can get the data back to PFS using KVC.

@dwhitena @jlewi regarding data sharding, currently there is no mechanism natively in KVC to do this. However, if an external service such as PFS takes care of sharding, KVC can be used distribute/make these shards available on the nodes if/when required using any of the volume sources in Kubernetes.

dwhitena commented 6 years ago

@balajismaniam @jlewi Pachyderm has a built in mechanism for data sharding, but that is on the Pachyderm worker level. The tricky thing here is that we are running distributed TF as, basically, on Pachyderm worker. I will need to think about this a bit, but I think putting all the data on all the nodes is a starting point.

@balajismaniam sounds great on the "clean-up" bit. Basically we will just want to gather anything that is written out to a specific directory (e.g., /volume/pfs/out) and commit it back to Pachyderm.

dwhitena commented 6 years ago

@balajismaniam Any updates here? Let me know if we can test anything.

balajismaniam commented 6 years ago

@dwhitena Sorry for the delay in response. The initial version of Pachyderm handler as discussed in https://github.com/IntelAI/vck/issues/22#issuecomment-380260734 has been implemented and merged. We (@Ajay191191 and I) could do a demo this feature next week. Would you have time for a meeting next week?

CC @scttl @nqn

carmine commented 6 years ago

Any update on this?

balajismaniam commented 6 years ago

@carmine The support for pachyderm was added in https://github.com/IntelAI/vck/pull/35.

intel / vck

Pachyderm File System Source Type for KVC #22