iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
3.99k stars 333 forks source link

Use of kernel 5.4 in base AWS image #1438

Closed OLSecret closed 6 months ago

OLSecret commented 6 months ago

Summary / Background

I run g5.12xlarge 4 gpus with Trainer and Accelerate. Getting Old kernel issue (5.5 needed, 5.4 given)

Scope

Accelerate fails with

Map: 100%|██████████| 4505/4505 [00:01<00:00, 3794.45 examples/s]
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You are adding a <class 'transformers.integrations.integration_utils.WandbCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
WandbCallback

  0%|          | 0/9950 [00:00<?, ?it/s]Bus error (core dumped)

I start instance like so:

...
      - name: Deploy runner on EC2
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.CML_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.CML_AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner launch \
              --cloud=aws \
              --cloud-region=us-west-2 \
              --cloud-gpu=v100 \
              --cloud-hdd-size=125 \
              --cloud-type=g5.12xlarge \
              --labels=cml-gpu \
  run:
    needs: launch-runner
    runs-on: [ cml-gpu ]
    container:
      image: docker://iterativeai/cml:0-dvc2-base1-gpu
      options: --gpus all  --network=host
    timeout-minutes: 2800 # 2 days
    permissions: write-all
    steps:
      - uses: actions/setup-node@v1
        with:
          node-version: '16'
      - uses: actions/checkout@v3
      - name: Set up Python 3.10
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Train models
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          OPENAI_SECRET_KEY: ${{ secrets.OPENAI_SK }}
        run: |
          apt update && apt install -y libc6 zip python-packaging git
          nvidia-smi
          pip install --upgrade pip
          pip install packaging torch==2.1.1 --index-url https://download.pytorch.org/whl/cu118
          pip install -r backend/requirements_training.txt
          pip install git+https://github.com/huggingface/accelerate
          pip install git+https://github.com/huggingface/transformers
          echo "# CML report" >> train_report.md
          wandb login ${{ secrets.WANDB_KEY }}
          cml comment update --watch train_report.md &
          python -m backend.management.commands.train_models_experimental
...

How do we get AWS to run the base image with a fresher kernel under the CML container?

0x2b3bfa0 commented 6 months ago

Hello, @OLSecret! Consider using the undocumented cml runner launch --cloud-image option to choose a more recent machine image. This option accepts any valid AWS AMI identifier for the --cloud-region you've chosen.

OLSecret commented 6 months ago

Nice, thank you, I will try that.