petuum / adaptdl

Resource-adaptive cluster scheduler for deep learning training.
https://adaptdl.readthedocs.io/
Apache License 2.0
426 stars 76 forks source link

hello_world can not run #112

Open czq693497091 opened 2 years ago

czq693497091 commented 2 years ago

I have installed the k8s (v1.18.2) in the local cluster and used helm(v2.17.0) to install adaptdl, adaptdl-sched successfully:

root@k8s-master:/home/czq/Pollux/adaptdl_v2/examples/mnist# kubectl get pod -A | grep adaptdl adaptdl adaptdl-registry-697884b65-wf4w6 1/1 Running 0 17h adaptdl jazzed-koala-adaptdl-sched-85d75fdb5d-9lvzq 3/3 Running 6 17h adaptdl jazzed-koala-validator-98f8fcf7c-jj959 1/1 Running 0 17h adaptdl peeking-ostrich-adaptdl-sched-667c78f9fb-fr2zj 3/3 Running 4 17h

and I write the hello_world protect the same as the introduction with the following structure: └── hello_world ├── adaptdljob.yaml ├── Dockerfile └── hello_world.py

I execute the "adaptdl submit hello_world" and get the following information:

/usr/lib/python3/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning) Using AdaptDL insecure registry. Sending build context to Docker daemon 4.096kB Step 1/4 : FROM python:3.7-slim ---> d3c9ad326043 Step 2/4 : RUN python3 -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple adaptdl ---> Using cache ---> 05dae174d67e Step 3/4 : COPY hello_world.py /root/hello_world.py ---> Using cache ---> 10d12170490d Step 4/4 : ENV PYTHONUNBUFFERED=true ---> Using cache ---> bc04efd29920 Successfully built bc04efd29920 Successfully tagged localhost:59283/adaptdl-submit:latest Using default tag: latest The push refers to repository [localhost:59283/adaptdl-submit] 2cab9519a560: Layer already exists 16f13637494a: Layer already exists 25ad0307b4c1: Layer already exists 874b45955cb1: Layer already exists 85c923303735: Layer already exists d0fa20bfdce7: Layer already exists 2edcec3590a4: Layer already exists latest: digest: sha256:7346ece45037f13481a30a50907418bbd460035f488a1aab3cfb0f8ebdf35644 size: 1790 W0126 21:25:38.652722 75926 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client. Unsupported storageclass from available storageclasses []

and I execute "adaptdl ls" but cannot get the information about this demo: root@k8s-master:/home/czq/Pollux/adaptdl_v2/examples/HelloWorld# adaptdl ls /usr/lib/python3/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (3.0.4) doesn't match a supported version! RequestsDependencyWarning) No adaptdljobs Name Status Start(UTC) Runtime Rplc Rtrt

I wonder how to cope with this problem and the job can correctly execute.

aurickq commented 2 years ago

Unsupported storageclass from available storageclasses []

It looks like your K8s might not have any storageclasses installed. AdaptDL requires a shared filesystem which can be used to store checkpoints and other information when a job is restarted. Once you have a storageclass for a shared filesystem installed, you can pass it into the submit command with --checkpoint-storage-class=....

SHu0421 commented 2 years ago

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

aurickq commented 2 years ago

@SHu0421 This error could be caused by a variety of reasons. You can start by checking kubectl -n <adaptdl namespace> get all (replacing <adaptdl namespace> with the namespace in which you installed the adaptdl scheduler).

gudiandian commented 2 years ago

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

Hi, have you solved the problem?

aurickq commented 2 years ago

@gudiandian it sounds like it's related to the problem you are having in https://github.com/petuum/adaptdl/issues/124

SHu0421 commented 2 years ago

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

Hi, have you solved the problem?

I changed microk8s to standard k8s instance (with three nodes), and I didn't met the problem again. By the way, I used the insecure registry rather than external registry.

gudiandian commented 2 years ago

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

Hi, have you solved the problem?

I changed microk8s to standard k8s instance (with three nodes), and I didn't met the problem again. By the way, I used the insecure registry rather than external registry.

Unfortunately, I am using standard k8s already. Thank you for your reply.