kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.57k stars 682 forks source link

Support Local Execution of Training Jobs #2231

Open franciscojavierarceo opened 3 weeks ago

franciscojavierarceo commented 3 weeks ago

What you would like to be added?

The Kubeflow Pipelines v2 API supports running and testing pipelines locally without the need for Kubernetes. Ideally, the TrainingClient could also be extended to run locally for both the v1 and forthcoming v2 API.

This is particularly appealing to Data Scientists who may not be as familiar with Kubernetes or Data Scientists that aim to develop and test their training jobs locally for a faster feedback loop.

As a means of comparison, this is what makes Ray's library so easy to get started for data scientists; i.e., their code just works without having to think too much about kubernetes.

Why is this needed?

Providing a great developer experience for Data Scientists is extremely valuable for growing adoption and catering to our end users.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 2 weeks ago

Thank you for creating this @franciscojavierarceo. I think, this is a great idea!

Please can you explain details on how KFP runs pipelines locally ? Do I need to have Docker runtime in my local environment to run it and do I need to have local Kind cluster running ?

/area sdk

andreyvelich commented 2 weeks ago

/remove-label lifecycle/needs-triage

franciscojavierarceo commented 2 weeks ago

So they allow for a local subprocess runner and docker runner.

The docker container approach is pretty straightforward (code below) but I actually like the Subprocess approach even though the KFP docs recommend the DockerRunner.

I understand why they recommend the Docker based approach but the Subprocess is just easier for data scientists. You can pass in a list of packages to the virtual environment that will be created to run the pipeline locally. I think that's probably the lowest-friction approach for Data Scientists to get started with the Training on Kubeflow (especially those unfamiliar with k8s).

I think the docker approach or the venv approach is probably all we would need as a start. Pipelines has to deal with complex DAG orchestration where Training only needs to worry about executing the train_func, which I think makes things much easier for local testing. We'd have to figure out how to best align a local run and the configuration parameters (e.g., num_workers, resources_per_worker, etc.) but I think that can be thought through in a spec.

Glad to hear you're supportive of this! I'll talk with folks on the team to investigate creating a spec on the implementation. 👍

Kubeflow Pipeline's docker runner implementation

# https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/local/docker_task_handler.py
def run_docker_container(
    client: 'docker.DockerClient',
    image: str,
    command: List[str],
    volumes: Dict[str, Any],
) -> int:
    image = add_latest_tag_if_not_present(image=image)
    image_exists = any(
        image in existing_image.tags for existing_image in client.images.list())
    if image_exists:
        print(f'Found image {image!r}\n')
    else:
        print(f'Pulling image {image!r}')
        repository, tag = image.split(':')
        client.images.pull(repository=repository, tag=tag)
        print('Image pull complete\n')
    container = client.containers.run(
        image=image,
        command=command,
        detach=True,
        stdout=True,
        stderr=True,
        volumes=volumes,
    )
    for line in container.logs(stream=True):
        # the inner logs should already have trailing \n
        # we do not need to add another
        print(line.decode(), end='')
    return container.wait()['StatusCode']