kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
728 stars 178 forks source link

How to set dshm size for training? #1044

Closed Andrew-Su-0718 closed 1 month ago

Andrew-Su-0718 commented 6 months ago

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

Andrew-Su-0718 commented 6 months ago

When I submit a pytorchjob with arena, I could't find parameters related to shared memory size, which is very important for pytorch training.

The size is fixed to 2Gi.

...
    - mountPath: /dev/shm
      name: dshm
...
...
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
...

Can anyone know how to set dshm size?

OK. I find a workaround solution. Modified file /charts/pytorchjob/values.yaml :

shmSize: 2Gi

to

shmSize: 64Gi # or any value you want
yanshui177 commented 2 months ago

Same issue

Syulin7 commented 2 months ago

/assign