iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4k stars 333 forks source link

cml cloud runner doesn't receive runner token #227

Closed MaxHuerlimann closed 3 years ago

MaxHuerlimann commented 4 years ago

Hi everybody, I saw some other issues like this, but the instructions there didn't seem to help.

We are trying to set up a CML pipeline using azure VMs for the heavy lifting. We have adapted all the necessary things in the docker-machine command

docker-machine create -d azure
      --azure-subscription-id $(az account show --query "id" -o tsv)
      --azure-client-id ${AZURE_SP_APP_ID}
      --azure-client-secret ${AZURE_SP_PASSWORD}
      --azure-location westeurope
      --azure-ssh-user cml_runner
      --azure-size $MACHINE_SIZE
      --azure-resource-group $RESOURCE_GROUP
      --azure-vnet $VNET
      --azure-subnet $SUBNET
      --azure-open-port 6006 
      $MACHINE

Now this works and the subsequent setting up of the NVIDIA drivers (following the same as in your blogpost and that seems to work. But then when we try to run the docker container, the runner is never registered in gitlab. We use this command:

docker run --name runner --gpus all -d
        -v /docker_machine/machine:/root/.docker/machine
        -e DOCKER_MACHINE=$MACHINE
        -e repo_token=$repo_token
        -e RUNNER_LABELS=$RUNNER_LABELS
        -e RUNNER_REPO=$CI_PROJECT_URL
        -e RUNNER_IDLE_TIMEOUT=600
        dvcorg/cml-py3

When I tested the setup locally on my machine and run the docker container in the foreground, I get this output:

Unregistering runner
Runtime platform                                    arch=amd64 os=linux pid=17 revision=86ad88ea version=13.3.0
Running in system-mode.                            

Shutting down docker machine
Error: RUNNER_TOKEN is needed to start the runner. Are you setting a runner?
    at run (/cml/bin/cml-cloud-runner-entrypoint.js:86:11)

So I guess the call for the runner token seems to fail somehow. Is there a possibility that azure somehow blocks the request to the gitlab api or are we missing something obvious here? We checked the repo_token's permission, so that should be okay. Can some other repository settings interfere here?

If you need more info about our system, feel free to ask.

DavidGOrtega commented 4 years ago

Hi!

Has your repo_token workflow privileges? Normal GITHUB_TOKEN won't work since to generate the RUNNER_TOKEN your $repo_token needs to have mentioned privileges.

MLOps Tutorial #4: GitHub Actions with your own GPUs

image

image

Must have repo and workflow

MaxHuerlimann commented 4 years ago

We are using GitLab. We gave it api, read_repository and write_repository permissions, as described in the documentation.

DavidGOrtega commented 4 years ago

And of course you setup it as a CI/CD env variable. Right?

DavidGOrtega commented 4 years ago

Off-topic: I have to warn you that in azure not all the resources are cleared completely and you may have to free ips according to #122

DavidGOrtega commented 4 years ago

Could you review if the the token is accesible in the workflow with a simple echo?

Another question that I have is: Are you hosting your own gitlab?

DavidGOrtega commented 4 years ago

When I tested the setup locally on my machine and run the docker container in the foreground, I get this output:

Can you try

curl --header "Private-Token: ${repo_token}" "${CI_API_V4_URL}/projects/<YOUR_GITLAB_NAME>/<YOUR_PROJECT>"

You should have a json in return with a runners_token propery

MaxHuerlimann commented 4 years ago

Yes the repo_token is in the list of environment variable defined for a group of repositories. When I echo it, it comes out masked, but therefore at least it is read from the variables and we checked it there.

No, we aren't hosting our own gitlab.

But I just realized that my local setup shouldn't work, I wasn't aware that cml internally uses GitLab provided environment variables. I didn't go through the effort to actually set up a local gitlab-runner. So I am not sure if the initial error message is relevant. On the pipeline it just fails silently, even when the docker run command is done in the foreground.

The curl gives both in the pipeline and locally (with the environment variables replaced with the actual values) {"error":"404 Not Found"}

MaxHuerlimann commented 4 years ago

If I use the repository ID, such as

curl --header "Private-Token: ${repo_token}" "${CI_API_V4_URL}/projects/<GITLAB_REPO_ID>" 

I receive the desired json object with the runners_token property, but if i put in my url path, let's say: ${CI_API_V4_URL}/projects/my-company/my-group-1/my-group-2/my-repo I don't receive the json, just the error 404.

MaxHuerlimann commented 4 years ago

It also works if I URL encode my repository url.

DavidGOrtega commented 4 years ago

In your private environment you can do

docker run --rm --name runner \
    -e RUNNER_LABELS=gpu \
    -e RUNNER_IDLE_TIMEOUT=120 \
    -e repo_token=XXXXXXXXXXXXX \
    -e RUNNER_REPO=https://gitlab.com/DavidGOrtega/3_tensorboard/ \
    dvcorg/cml:latest

Jut not to mix whats going on with your local and the pipeline. Can you please try to do a curl and see the output in the command in the foreground inside the pipeline? (just remove the detach option)

curl $CI_PROJECT_URL
docker run --name runner --gpus all
        -v /docker_machine/machine:/root/.docker/machine
        -e DOCKER_MACHINE=$MACHINE
        -e repo_token=$repo_token
        -e RUNNER_LABELS=$RUNNER_LABELS
        -e RUNNER_REPO=$CI_PROJECT_URL
        -e RUNNER_IDLE_TIMEOUT=10
        dvcorg/cml-py3

Also... could you share your project URL? The code is expecting something like

${CI_API_V4_URL}/projects/${owner}%2F${repo}

Its formed that way? I make this question because of this:

${CI_API_V4_URL}/projects/my-company/my-group-1/my-group-2/my-repo I don't receive the json, just the error 404.

MaxHuerlimann commented 4 years ago

I will try the proposals when I am back on my computer.

Our URL is of the form of my example. The repo is contained in two groups, therefore not of the form the code expects. I just checked the code and that probably is the problem. Would it maybe be easier to use the repository ID? The API accepts that too and there wouldn't be any problems in the case of repos contained in groups like ours.

DavidGOrtega commented 4 years ago

Our URL is of the form of my example.

So that would be the error then.

Would it maybe be easier to use the repository ID?

Definitely that would be a solution. However we are following gitlab's api specs.

I'm confirming it and raising a ticket to fix it, thanks a lot for all the support. 😃

MaxHuerlimann commented 4 years ago

Sounds good! Do you still need me to run your proposals?

DavidGOrtega commented 3 years ago

@MaxHuerlimann sorry for the late reply. Im working on this today would be awesome if you can check during the PR

MaxHuerlimann commented 3 years ago

@DavidGOrtega Yes, just let me know when it is ready

DavidGOrtega commented 3 years ago

@MaxHuerlimann its ready, please can you check?

MaxHuerlimann commented 3 years ago

It is running now, but in a first quick test it seemed to be working. Thanks a lot for the help and the development of the tool! :)