kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.5k stars 1.58k forks source link

[feature] Set runtime options in kwargs when using Docker with kfp local #10544

Closed Nishikoh closed 1 month ago

Nishikoh commented 4 months ago

Feature Area

/area sdk

What feature would you like to see?

Set runtime options in kwargs when using Docker with kfp local.

What is the use case or pain point?

Currently, no options can be set at runtime. For example, running a machine learning training task may result in an error due to the small size of shm. To solve this, we would like to be able to set options at runtime.

Here is an example of an Error log.

    2024-03-06 05:10:18 | INFO     | yolox.core.trainer:155 - init prefetcher, this might take one minute or less...
    ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
    ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
    ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
    object address  : 0x7f63b5daf700
    object refcount : 2
    object type     : 0x92c320
    object type name: RuntimeError
    object repr     : RuntimeError('DataLoader worker (pid(s) 49) exited unexpectedly')
    lost sys.stderr
    [KFP Executor 2024-03-06 05:10:29,632 INFO]: Wrote executor output ...

Is there a workaround currently?

I have no idea.


Love this idea? Give it a đź‘Ť.

rimolive commented 4 months ago

Can you ellaborate? I understand you would like to test pipelines locally but depending on the resources required to train that model you need a production-ready kubernetes cluster. If you can provide code snippets, configuration files, or else it will help us understand the problem.

Nishikoh commented 4 months ago

The example above is an example of object detection using YOLOX. In the production environment, I will be using Vertex AI pipelines for training, but first I want to make sure that the components work as intended in the local environment. As the local execution is a behaviour check, I run with small datasets and a small number of epochs. Full size datasets and epochs will be run on Vertex AI pipelines for the production environment.

I will give an example that is difficult to prepare as a code snippet for YOLOX above, but requires CUDA for execution. This requires an option to be GPU aware at runtime.

from kfp import dsl, local

local.init(runner=local.DockerRunner())

@dsl.container_component
def gpu_processing():
    return dsl.ContainerSpec(
        image="gcr.io/google_containers/cuda-vector-add:v0.1",
    )

task = gpu_processing()

When I run it, it does not detect CUDA and gives me an error.

02:21:39.615 - INFO - Executing task 'gpu-processing'
02:21:39.615 - INFO - Streamed logs:

    Pulling image 'gcr.io/google_containers/cuda-vector-add:v0.1'
    Image pull complete

    Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
    [Vector addition of 50000 elements]

I expect the following results.

10:43:49.816 - INFO - Executing task 'gpu-processing'
10:43:49.816 - INFO - Streamed logs:

    Found image 'gcr.io/google_containers/cuda-vector-add:v0.1'

    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
10:43:51.690 - INFO - Task 'gpu-processing' finished with status SUCCESS
10:43:51.691 - INFO - Task 'gpu-processing' has no outputs

If the user can configure the Docker runtime options, the results will be as expected.

rimolive commented 3 months ago

Are you following this method to execute the pipeline? https://www.kubeflow.org/docs/components/pipelines/v2/local-execution/

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.