kubeflow-kale / kale

Kubeflow’s superfood for Data Scientists
http://kubeflow-kale.github.io
Apache License 2.0
628 stars 129 forks source link

a trivial question about kale (or argo) resource allocation #368

Open coldtomatojuice opened 3 years ago

coldtomatojuice commented 3 years ago

Hello, thank you for your great work

I faced the issue when I deployed my own pipeline with kale. it's a small gan network, trained with 10k images 256x256 I just executed the jupyter notebook, and it worked fine with the notebook on kubeflow 4cpu and 8Gi memory are allocated to the notebook

but in the meantime I got the pipeline started with kale, the pod where train function is in it is killed with OOM

I found that, only 128Mi of memory is allocated to the pod in which the train function is allocated Limits: cpu: 1 memory: 2Gi Requests: cpu: 100m memory: 128Mi

Can I rearrange the size of memory to be allocated to the pod before the pipeline gets run? Is there any way I can fix the resource to be allocated to each pipeline pods with kale?

coldtomatojuice commented 3 years ago

so the reason I found why the resources are allocated like I mentioned above is that "limitrange" of k8s is set to do so

Since I set the default limit and request values to pods all the kale operations are in a fixed resource quota Therefore it's still not available to allocate different size of the cpu and memory to each pipeline components.