kenwoodjw / kubeflow-install

1 stars 1 forks source link

kubeflow很消耗资源,有最小安装吗 #3

Open wangpf09 opened 1 year ago

wangpf09 commented 1 year ago

如题,本机启动测试,资源有限,部署不起来,如果只需要最基本的一个pytorch的分布式训练需要那些组件呢?请大佬解惑

kenwoodjw commented 1 year ago

单机的话就不建议选择kubeflow了,你分布式训练是单机多卡?还是多机多卡?

wangpf09 commented 1 year ago

多机多卡的,我这个只是用来测试的,和正式使用不一样的

kenwoodjw commented 1 year ago

kubeflow依赖k8s,至少是需要三个节点的

wangpf09 commented 1 year ago

大佬好,我现在已经部署起来了kubeflow,但是运行官方的mnist示例时一直没反应,可以帮忙看下问题吗?PytorchJob的yml如下:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-mnist-ddp-gpu
  namespace: kubeflow-user-example-com
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - image: chineseocr-registry.cn-chengdu.cr.aliyuncs.com/yihecode/yihecode:pytorch.mnist.ddp.gpu.v3
              name: pytorch
              resources:
                limits:
                  cpu: '1'
                  memory: 4Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /mnt/kubeflow-gcfs
                  name: kubeflow-gcfs
          volumes:
            - name: kubeflow-gcfs
              persistentVolumeClaim:
                claimName: kubeflow-gcfs
                readOnly: false
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - image: chineseocr-registry.cn-chengdu.cr.aliyuncs.com/yihecode/yihecode:pytorch.mnist.ddp.gpu.v3
              name: pytorch
              resources:
                limits:
                  cpu: '1'
                  memory: 4Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /mnt/kubeflow-gcfs
                  name: kubeflow-gcfs
          volumes:
            - name: kubeflow-gcfs
              persistentVolumeClaim:
                claimName: kubeflow-gcfs
                readOnly: false

8d731664134b224973a790c50a2885d

kenwoodjw commented 1 year ago

看报错是超时问题,https://gemfury.com/neilisaac/python:torch/-/content/distributed/distributed_c10d.py,不是kubeflow的问题

wangpf09 commented 1 year ago

看报错是超时问题,https://gemfury.com/neilisaac/python:torch/-/content/distributed/distributed_c10d.py,不是kubeflow的问题

看着是超时,我是运行的官方的一个pytorch-mnist的测试项目,不确当是不是代码有问题