iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4.04k stars 339 forks source link

Request: provide versioned docker images in ghcr/dockerhub #1086

Closed merryHunter closed 2 years ago

merryHunter commented 2 years ago

Hi! Currently, the base CML Docker images are rebuilt based on the latest code and pushed every day to e.g. docker://ghcr.io/iterative/cml:0-dvc2-base1 or https://hub.docker.com/r/iterativeai/cml/tags. That means there is no way to make a rollback to previous version. Unfortunately, recent changes affected our cloud training pipelines and we had to make adjustments to them.

In my opinion it would be beneficial to have stable, fixed versioned docker images. That would ensure that once we pull from them, there is no chance something is updated or broken.

0x2b3bfa0 commented 2 years ago

Related

0x2b3bfa0 commented 2 years ago

@merryHunter, are you using GitHub Actions? In that case, you can pin an exact CML version by using the following setup step instead of a container:

- uses: iterative/setup-cml@v1
  with:
    version: 0.14.0

Additionally, this will allow you to use any container image or just remove it altogether.

merryHunter commented 2 years ago

Hi @0x2b3bfa0 , thanks for references to issues! No, we are using Gitlab CI, that's why we cannot access it at the moment.

0x2b3bfa0 commented 2 years ago

Then, you can try installing one of our binary releases:

curl https://github.com/iterative/cml/releases/download/v0.16.1/cml-linux-x64 --output /usr/bin/cml && chmod a+x $_
merryHunter commented 2 years ago

I am not aware about the recent changes, but just to give a bit of more context to the problem we faced: we have a CI in Gitlab where a cml-runner is launching training on AWS with a startup script that mounts EFS to access the data. For no reason, our startup script started to silently fail while the cml runner job was successful. After debugging and looking into script logs at ec2 instance, we saw error like E: Could not get lock /var/lib/dpkg/lock – open (11: Resource temporarily unavailable) E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?, as we had apt-get update instruction. We fixed it by adding sleep command in the beginning, allowing this way some new process to finish installing the software, but still, no ones can prevent something to happen in the future with current approach of tagging docker images.

We also have problems with passing down env variable 'DOCKER_SHM_SIZE=4g', but that's another issue.

0x2b3bfa0 commented 2 years ago

Thank you for the detailed description of the issue. πŸ™

Pinning CML might not suffice to solve this issue, as it depends internally on https://github.com/iterative/terraform-provider-iterative (unpinned) to provision cloud instances. Moreover machine images aren't pinned either.

0x2b3bfa0 commented 2 years ago

The provided startup script runs synchronously. Therefore, your issue can only (?) be caused by an ongoing automatic upgrade. πŸ€”

merryHunter commented 2 years ago

Exactly, that's the problem with the software upgrade we identified. However, that only means that as CML depends on TPI (which a new tool btw) that can be changed in unexpected way, it would be really great to have at least major releases tagged in dockerhub. I have read the threads, I see it's a hard decision to use certain tag naming, yet the problem is there.

dacbd commented 2 years ago

there is a hidden cml option you can use to pin a tpi/cml version for your created instance.

cml runner ... \
    --cml-version="v0.15.2" \
    --tpi-version="= 0.10.18" \
...
casperdcl commented 2 years ago

fixed it by adding sleep command in the beginning

@merryHunter we just did the same in https://github.com/iterative/terraform-provider-iterative/pull/621 (what cml runner uses under-the-hood) so you don't have to :)

merryHunter commented 2 years ago

@casperdcl @0x2b3bfa0 that's amazing patch!:) Glad our issue helped identify the problem. Then let's close this issue.

0x2b3bfa0 commented 2 years ago

For those who are looking for proper, production-grade container images: there aren't any.

See also

dacbd commented 2 years ago

cough --tpi-version

On Mon, Jul 4, 2022, 12:57 Helio Machado @.***> wrote:

Thank you for the detailed description of the issue. πŸ™

Pinning CML might not suffice to solve this issue, as it depends internally on https://github.com/iterative/terraform-provider-iterative (unpinned) to provision cloud instances. Moreover machine images aren't pinned either.

β€” Reply to this email directly, view it on GitHub https://github.com/iterative/cml/issues/1086#issuecomment-1174217973, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN7M57SHD6L7NZWRCLVFTVSM637ANCNFSM52T6ADPQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>