Open wangpf09 opened 1 year ago
单机的话就不建议选择kubeflow了,你分布式训练是单机多卡?还是多机多卡?
多机多卡的,我这个只是用来测试的,和正式使用不一样的
kubeflow依赖k8s,至少是需要三个节点的
大佬好,我现在已经部署起来了kubeflow,但是运行官方的mnist示例时一直没反应,可以帮忙看下问题吗?PytorchJob的yml如下:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist-ddp-gpu
namespace: kubeflow-user-example-com
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- image: chineseocr-registry.cn-chengdu.cr.aliyuncs.com/yihecode/yihecode:pytorch.mnist.ddp.gpu.v3
name: pytorch
resources:
limits:
cpu: '1'
memory: 4Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/kubeflow-gcfs
name: kubeflow-gcfs
volumes:
- name: kubeflow-gcfs
persistentVolumeClaim:
claimName: kubeflow-gcfs
readOnly: false
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- image: chineseocr-registry.cn-chengdu.cr.aliyuncs.com/yihecode/yihecode:pytorch.mnist.ddp.gpu.v3
name: pytorch
resources:
limits:
cpu: '1'
memory: 4Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/kubeflow-gcfs
name: kubeflow-gcfs
volumes:
- name: kubeflow-gcfs
persistentVolumeClaim:
claimName: kubeflow-gcfs
readOnly: false
看报错是超时问题,https://gemfury.com/neilisaac/python:torch/-/content/distributed/distributed_c10d.py,不是kubeflow的问题
看着是超时,我是运行的官方的一个pytorch-mnist的测试项目,不确当是不是代码有问题
如题,本机启动测试,资源有限,部署不起来,如果只需要最基本的一个pytorch的分布式训练需要那些组件呢?请大佬解惑