Closed merryHunter closed 2 years ago
@merryHunter, are you using GitHub Actions? In that case, you can pin an exact CML version by using the following setup step instead of a container:
- uses: iterative/setup-cml@v1
with:
version: 0.14.0
Additionally, this will allow you to use any container image or just remove it altogether.
Hi @0x2b3bfa0 , thanks for references to issues! No, we are using Gitlab CI, that's why we cannot access it at the moment.
Then, you can try installing one of our binary releases:
curl https://github.com/iterative/cml/releases/download/v0.16.1/cml-linux-x64 --output /usr/bin/cml && chmod a+x $_
I am not aware about the recent changes, but just to give a bit of more context to the problem we faced: we have a CI in Gitlab where a cml-runner is launching training on AWS with a startup script that mounts EFS to access the data. For no reason, our startup script started to silently fail while the cml runner job was successful. After debugging and looking into script logs at ec2 instance, we saw error like E: Could not get lock /var/lib/dpkg/lock β open (11: Resource temporarily unavailable) E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
, as we had apt-get update
instruction. We fixed it by adding sleep command in the beginning, allowing this way some new process to finish installing the software, but still, no ones can prevent something to happen in the future with current approach of tagging docker images.
We also have problems with passing down env variable 'DOCKER_SHM_SIZE=4g', but that's another issue.
Thank you for the detailed description of the issue. π
Pinning CML might not suffice to solve this issue, as it depends internally on https://github.com/iterative/terraform-provider-iterative (unpinned) to provision cloud instances. Moreover machine images aren't pinned either.
The provided startup script runs synchronously. Therefore, your issue can only (?) be caused by an ongoing automatic upgrade. π€
Exactly, that's the problem with the software upgrade we identified. However, that only means that as CML depends on TPI (which a new tool btw) that can be changed in unexpected way, it would be really great to have at least major releases tagged in dockerhub. I have read the threads, I see it's a hard decision to use certain tag naming, yet the problem is there.
there is a hidden cml option you can use to pin a tpi/cml version for your created instance.
cml runner ... \
--cml-version="v0.15.2" \
--tpi-version="= 0.10.18" \
...
fixed it by adding sleep command in the beginning
@merryHunter we just did the same in https://github.com/iterative/terraform-provider-iterative/pull/621 (what cml runner
uses under-the-hood) so you don't have to :)
@casperdcl @0x2b3bfa0 that's amazing patch!:) Glad our issue helped identify the problem. Then let's close this issue.
For those who are looking for proper, production-grade container images: there aren't any.
If you use GitHub Actions, use iterative/setup-cml
as suggested on https://github.com/iterative/cml/issues/1086#issuecomment-1174123131
If you use GitLab CI/CD or Bitbucket Pipelines, install a standalone binary as suggested on https://github.com/iterative/cml/issues/1086#issuecomment-1174157306
cough --tpi-version
On Mon, Jul 4, 2022, 12:57 Helio Machado @.***> wrote:
Thank you for the detailed description of the issue. π
Pinning CML might not suffice to solve this issue, as it depends internally on https://github.com/iterative/terraform-provider-iterative (unpinned) to provision cloud instances. Moreover machine images aren't pinned either.
β Reply to this email directly, view it on GitHub https://github.com/iterative/cml/issues/1086#issuecomment-1174217973, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN7M57SHD6L7NZWRCLVFTVSM637ANCNFSM52T6ADPQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi! Currently, the base CML Docker images are rebuilt based on the latest code and pushed every day to e.g. docker://ghcr.io/iterative/cml:0-dvc2-base1 or https://hub.docker.com/r/iterativeai/cml/tags. That means there is no way to make a rollback to previous version. Unfortunately, recent changes affected our cloud training pipelines and we had to make adjustments to them.
In my opinion it would be beneficial to have stable, fixed versioned docker images. That would ensure that once we pull from them, there is no chance something is updated or broken.