pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.53k stars 615 forks source link

Improve versions update for docker building #1879

Open vfdev-5 opened 3 years ago

vfdev-5 commented 3 years ago

cc @trsvchn @ydcjeff

ydcjeff commented 3 years ago

@vfdev-5 any idea how can we import yaml inside yaml ? I found out that circleci yaml is just vanilla yaml, so we can't do normally.

Two options I found out currently

I haven't tried option 1, but have tried option 2: Dir Structure

.circleci/src
├── commands
│   ├── install_dependencies.yml
│   ├── install_latest_nvidia.yml
│   ├── pull_pytorch_stable_devel_image.yml
│   ├── pull_pytorch_stable_image.yml
│   ├── run_pytorch_container.yml
│   └── run_pytorch_devel_container.yml
├── config.yml
├── executors
│   ├── one_gpu.yml
│   ├── one_gpu_windows.yml
│   └── two_gpus.yml
└── jobs
    ├── build_publish_docker_images.yml
    ├── one_gpu_tests.yml
    ├── one_gpu_windows_tests.yml
    ├── two_gpus_check_dist_cifar10_example.yml
    ├── two_gpus_hvd_tests.yml
    └── two_gpus_tests.yml

3 directories, 16 files

The folder names define the names we defined in .circleci/config.yml (jobs is jobs which will contain the jobs we will run, same to commands, executors). What I don't like is this creates many files which only contains a small amount of commands. But what do you think tho ? @vfdev-5 @trsvchn

What's inside above config.yml:

```yaml version: 2.1 parameters: pytorch_stable_image: type: string # https://hub.docker.com/r/pytorch/pytorch/tags default: "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime" pytorch_stable_image_devel: type: string # https://hub.docker.com/r/pytorch/pytorch/tags default: "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel" workingdir: type: string default: "/tmp/ignite" should_build_docker_images: type: boolean default: false should_publish_docker_images: type: boolean default: false build_docker_image_pytorch_version: type: string default: "1.8.1-cuda11.1-cudnn8" build_docker_image_hvd_version: type: string default: "v0.21.3" build_docker_image_msdp_version: type: string default: "v0.3.10" workflows: version: 2 gpu_tests: unless: << pipeline.parameters.should_build_docker_images >> jobs: - one_gpu_tests - one_gpu_windows_tests - two_gpus_tests - two_gpus_check_dist_cifar10_example - two_gpus_hvd_tests docker_images: when: << pipeline.parameters.should_build_docker_images >> jobs: - build_publish_docker_images ```

Here's option 2 output:

```yaml commands: install_dependencies: steps: - run: command: | docker exec -it pthd pip install -r requirements-dev.txt export install_apex_cmd='pip install -v --disable-pip-version-check --no-cache-dir git+https://github.com/NVIDIA/apex' export install_git_apex_cmd="apt-get update && apt-get install -y --no-install-recommends git && ${install_apex_cmd}" docker exec -it pthd /bin/bash -c "$install_git_apex_cmd" export install_ignite_cmd='python setup.py install' docker exec -it pthd /bin/bash -c "$install_ignite_cmd" name: Install dependencies install_latest_nvidia: steps: - run: command: | sudo apt-get purge nvidia* && sudo apt-get autoremove sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-455 cuda-drivers-455 # Install nvidia-container-runtime sudo apt-get install -y nvidia-container-runtime # Reload driver : https://stackoverflow.com/a/45319156/6309199 # lsof | grep nvidia -> kill Xvfb sudo lsof | grep "/usr/bin/Xvfb" | head -1 | awk '{print $2}' | xargs -I {} sudo kill -9 {} # lsmod | grep nvidia sudo rmmod nvidia_uvm && sudo rmmod nvidia_drm && sudo rmmod nvidia_modeset && sudo rmmod nvidia # reload driver nvidia-smi name: Install latest NVidia-driver and CUDA pull_pytorch_stable_devel_image: steps: - run: command: | docker pull << pipeline.parameters.pytorch_stable_image_devel >> name: Pull PyTorch Stable Develop Image pull_pytorch_stable_image: steps: - run: command: | docker pull << pipeline.parameters.pytorch_stable_image >> name: Pull PyTorch Stable Image run_pytorch_container: steps: - run: command: | docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image >> docker exec -it pthd nvidia-smi docker exec -it pthd ls environment: wd: << pipeline.parameters.workingdir >> name: Start Pytorch container run_pytorch_devel_container: steps: - run: command: | docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image_devel >> docker exec -it pthd nvidia-smi docker exec -it pthd ls environment: wd: << pipeline.parameters.workingdir >> name: Start Pytorch dev container executors: one_gpu: machine: docker_layer_caching: true image: ubuntu-1604-cuda-11.1:202012-01 resource_class: gpu.small one_gpu_windows: machine: image: windows-server-2019-nvidia:stable resource_class: windows.gpu.nvidia.medium shell: bash.exe two_gpus: machine: docker_layer_caching: true image: ubuntu-1604-cuda-11.1:202012-01 resource_class: gpu.medium jobs: build_publish_docker_images: docker: - image: cimg/python:3.8.8 resource_class: 2xlarge steps: - checkout - setup_remote_docker: docker_layer_caching: true version: 19.03.14 - run: command: | pip --version pip install docker name: Install deps - run: command: | cd docker export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >> export HVD_VERSION=<< pipeline.parameters.build_docker_image_hvd_version >> bash build.sh hvd hvd-base bash build.sh hvd hvd-vision bash build.sh hvd hvd-nlp bash build.sh hvd hvd-apex bash build.sh hvd hvd-apex-vision bash build.sh hvd hvd-apex-nlp name: Build all Horovod flavoured PyTorch-Ignite images - run: command: | cd docker export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >> bash build.sh main base bash build.sh main vision bash build.sh main nlp bash build.sh main apex bash build.sh main apex-vision bash build.sh main apex-nlp name: Build all PyTorch-Ignite images - run: command: | cd docker export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >> export MSDP_VERSION=<< pipeline.parameters.build_docker_image_msdp_version >> bash build.sh msdp msdp-apex bash build.sh msdp msdp-apex-vision bash build.sh msdp msdp-apex-nlp name: Build all MS DeepSpeed flavoured PyTorch-Ignite images - run: command: docker images | grep pytorchignite name: List built images - when: condition: << pipeline.parameters.should_publish_docker_images >> steps: - run: command: | cd docker sh ./push_all.sh name: Push all PyTorch-Ignite Docker images working_directory: << pipeline.parameters.workingdir >> one_gpu_tests: executor: one_gpu steps: - checkout - run: command: | bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*" name: Trigger job if modified - pull_pytorch_stable_image - run_pytorch_container - install_dependencies - run: command: |4 # pytest on cuda export test_cmd='bash tests/run_gpu_tests.sh' docker exec -it pthd /bin/bash -c "${test_cmd}" # MNIST tests # 0) download MNIST # https://github.com/pytorch/ignite/issues/1737 export raw_mnist_dir='./MNIST/raw' export download_mnist_cmd="git clone https://github.com/pytorch-ignite/download-mnist-github-action.git $raw_mnist_dir" docker exec -it pthd /bin/bash -c "$download_mnist_cmd" export mnist0_cmd="CUDA_VISIBLE_DEVICES=0 python $raw_mnist_dir/run.py ." docker exec -it pthd /bin/bash -c "$mnist0_cmd" # 1) mnist.py export minst1_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist.py --epochs=1' docker exec -it pthd /bin/bash -c "$minst1_cmd" # 2) mnist_with_visdom.py export visdom_script_cmd='python -c "from visdom.server import download_scripts; download_scripts()"' export visdom_cmd='python -m visdom.server' docker exec -d pthd /bin/bash -c "$visdom_script_cmd && $visdom_cmd" export sleep_cmd='sleep 10' export mnist2_cmd='python examples/mnist/mnist_with_visdom.py --epochs=1' docker exec -it pthd /bin/bash -c "$sleep_cmd && $mnist2_cmd" # 3.1) mnist_with_tensorboard.py with tbX export mnist3_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_with_tensorboard.py --epochs=1' docker exec -it pthd /bin/bash -c "$mnist3_cmd" # uninstall tensorboardX export pip_cmd='pip uninstall -y tensorboardX' docker exec -it pthd /bin/bash -c "$pip_cmd" # 3.2) mnist_with_tensorboard.py with native torch tb docker exec -it pthd /bin/bash -c "$mnist3_cmd" # 4) mnist_save_resume_engine.py # save export mnist4_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_save_resume_engine.py --epochs=2 --crash_iteration 1100' docker exec -it pthd /bin/bash -c "$mnist4_cmd" # resume export mnist4_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_save_resume_engine.py --epochs=2 --resume_from=/tmp/mnist_save_resume/checkpoint_1.pt' docker exec -it pthd /bin/bash -c "$mnist4_cmd" name: Run GPU Unit Tests and Examples - run: command: | bash <(curl -s https://codecov.io/bash) -Z -F gpu name: Codecov upload working_directory: << pipeline.parameters.workingdir >> one_gpu_windows_tests: executor: one_gpu_windows steps: - checkout - run: command: | bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*" name: Trigger job if modified - run: command: | conda --version conda install -y pytorch torchvision cudatoolkit=11.1 -c pytorch -c conda-forge pip install -r requirements-dev.txt pip install . name: Install dependencies - run: command: | # pytest on cuda SKIP_DISTRIB_TESTS=1 bash tests/run_gpu_tests.sh name: Run GPU Unit Tests working_directory: << pipeline.parameters.workingdir >> two_gpus_check_dist_cifar10_example: executor: two_gpus steps: - checkout - run: command: | bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*" name: Trigger job if modified - pull_pytorch_stable_image - run_pytorch_container - install_dependencies - run: command: | docker exec -it pthd pip install fire name: Install additional example dependencies - run: command: | export example_path="examples/contrib/cifar10" # initial run export stop_cmd="--stop_iteration=500" export test_cmd="CI=1 python ${example_path}/main.py run --checkpoint_every=200" docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}" # resume export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-500/training_checkpoint_400.pt" docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}" name: Run without backend - run: command: | export example_path="examples/contrib/cifar10" # initial run export stop_cmd="--stop_iteration=500" export test_cmd="CI=1 python -u -m torch.distributed.launch --nproc_per_node=2 --use_env ${example_path}/main.py run --backend=nccl --checkpoint_every=200" docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}" # resume export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-500/training_checkpoint_400.pt" docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}" name: Run with NCCL backend using torch dist launch - run: command: | export example_path="examples/contrib/cifar10" # initial run export stop_cmd="--stop_iteration=500" export test_cmd="CI=1 python -u ${example_path}/main.py run --backend=nccl --nproc_per_node=2 --checkpoint_every=200" docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}" # resume export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-500/training_checkpoint_400.pt" docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}" name: Run with NCCL backend using spawn working_directory: << pipeline.parameters.workingdir >> two_gpus_hvd_tests: executor: two_gpus steps: - checkout - run: command: | bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*" name: Trigger job if modified - pull_pytorch_stable_devel_image - run_pytorch_devel_container - install_dependencies - run: command: |4 # Following https://github.com/horovod/horovod/blob/master/Dockerfile.test.gpu # and https://github.com/horovod/horovod/issues/1944#issuecomment-628192778 docker exec -it pthd /bin/bash -c "apt-get update && apt-get install -y git" docker exec -it pthd /bin/bash -c "git clone --recursive https://github.com/horovod/horovod.git /horovod && cd /horovod && python setup.py sdist" docker exec -it pthd /bin/bash -c "conda install -y cmake nccl=2.8 -c conda-forge" docker exec -it pthd /bin/bash -c 'cd /horovod && HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_LINK=SHARED HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_PYTORCH=1 pip install -v $(ls /horovod/dist/horovod-*.tar.gz) && ldconfig' docker exec -it pthd horovodrun --check-build name: Install Horovod with NCCL GPU ops - run: command: | export test_cmd='bash tests/run_gpu_tests.sh' docker exec -it pthd /bin/bash -c "${test_cmd}" # no CUDA devices Horovod tests export test_cmd='CUDA_VISIBLE_DEVICES="" pytest --cov ignite --cov-append --cov-report term-missing --cov-report xml -vvv tests/ -m distributed' docker exec -it pthd /bin/bash -c "${test_cmd}" name: Run 1 Node 2 GPUs Unit Tests - run: command: | bash <(curl -s https://codecov.io/bash) -Z -F gpu-2-hvd name: Codecov upload - run: command: | docker exec -it pthd pip install fire export example_path="examples/contrib/cifar10" # initial run export stop_cmd="--stop_iteration=500" export test_cmd="cd ${example_path} && CI=1 horovodrun -np 2 python -u main.py run --backend=horovod --checkpoint_every=200" docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}" # resume export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-horovod-2_stop-on-500/training_checkpoint_400.pt" docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}" name: Check CIFAR10 using horovodrun - run: command: | export example_path="examples/contrib/cifar10" # initial run export stop_cmd="--stop_iteration=500" export test_cmd="cd ${example_path} && CI=1 python -u main.py run --backend=horovod --nproc_per_node=2 --checkpoint_every=200" docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}" # resume export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-horovod-2_stop-on-500/training_checkpoint_400.pt" docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}" name: Check CIFAR10 using spawn working_directory: << pipeline.parameters.workingdir >> two_gpus_tests: executor: two_gpus steps: - checkout - run: command: | bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*" name: Trigger job if modified - pull_pytorch_stable_image - run_pytorch_container - install_dependencies - run: command: | export test_cmd='bash tests/run_gpu_tests.sh 2' docker exec -it pthd /bin/bash -c "${test_cmd}" name: Run 1 Node 2 GPUs Unit Tests - run: command: | bash <(curl -s https://codecov.io/bash) -Z -F gpu-2 name: Codecov upload working_directory: << pipeline.parameters.workingdir >> parameters: build_docker_image_hvd_version: default: v0.21.3 type: string build_docker_image_msdp_version: default: v0.3.10 type: string build_docker_image_pytorch_version: default: 1.8.1-cuda11.1-cudnn8 type: string pytorch_stable_image: default: pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime type: string pytorch_stable_image_devel: default: pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel type: string should_build_docker_images: default: false type: boolean should_publish_docker_images: default: false type: boolean workingdir: default: /tmp/ignite type: string version: 2.1 workflows: docker_images: jobs: - build_publish_docker_images when: << pipeline.parameters.should_build_docker_images >> gpu_tests: jobs: - one_gpu_tests - one_gpu_windows_tests - two_gpus_tests - two_gpus_check_dist_cifar10_example - two_gpus_hvd_tests unless: << pipeline.parameters.should_build_docker_images >> version: 2 ```
trsvchn commented 3 years ago

any idea how can we import yaml inside yaml ?

I think only GitLab has that feature

Gitlab has a similar feature using the include keyword to include workflow templates; and in addition the extends keyword can be used to share small bits of yaml within the same yaml file

vfdev-5 commented 3 years ago

@ydcjeff thanks for providing these options! Yes, there are pros/cons in all those approaches. Maybe, a third approach is to read docker values similarly to

python -c "import yaml; f=open('.circleci/config.yml'); d=yaml.safe_load(f); print(d['parameters']['build_docker_image_pytorch_version']['default'])"
vfdev-5 commented 3 years ago

@trsvchn or @ydcjeff would you like to solve this issue. I'd like to build new docker images this week.

EDIT: Probably, we can do that manually for now, before the issue has been solved

trsvchn commented 3 years ago

@vfdev-5 I have another idea, but I have zero experience with circleci, can we do smth like this?

Simply use Makefile with defined versions:

# Makefile

BUILD_DOCKER_IMAGE_PYTORCH_VERSION = 1.8.1-cuda11.1-cudnn8                                  
BUILD_DOCKER_IMAGE_HVD_VERSION = v0.21.3
BUILD_DOCKER_IMAGE_MSDP_VERSION = v0.3.10

get_build_docker_image_pytorch_version:
        @echo $(BUILD_DOCKER_IMAGE_PYTORCH_VERSION)

get_build_docker_image_hvd_version:
        @echo $(BUILD_DOCKER_IMAGE_HVD_VERSION)

get_build_docker_image_msdp_version:
        @echo $(BUILD_DOCKER_IMAGE_MSDP_VERSION)

Then use it inside circleci.config (if possible)

# to get pytorch verison
build_docker_image_pytorch_version = make get_build_docker_image_pytorch_version
...

And the same for GHA:

           export PTH_VERSION=`make get_build_docker_image_pytorch_version` 
vfdev-5 commented 3 years ago

Yes, we can do something like that but I'm not a fan of using another scripting langs in addition to bash and python... We can think of https://github.com/pydoit/doit or python for that if needed.

trsvchn commented 3 years ago

Yeah, agree Makefile is not very obvious tool, There is "the strangely familiar workflow utility " from Ken Reitz: https://github.com/kenreitz42/bake

vfdev-5 commented 3 years ago

No, let's keep things without new deps

trsvchn commented 3 years ago

@ydcjeff thanks for providing these options! Yes, there are pros/cons in all those approaches. Maybe, a third approach is to read docker values similarly to

python -c "import yaml; f=open('.circleci/config.yml'); d=yaml.safe_load(f); print(d['parameters']['build_docker_image_pytorch_version']['default'])"

Another idea is to add these lines to new docker.cfg ini file, then no need to use pyaml, and we have strings here

[DEFAULT]
build_docker_image_pytorch_version = 1.8.1-cuda11.1-cudnn8                                  
build_docker_image_hvd_version = v0.21.3
build_docker_image_msdp_version = v0.3.10

Then:

python -c "import configparser; print(configparser.ConfigParser().read('docker.cfg')['DEFAULT']['build_docker_image_pytorch_version'])"
vfdev-5 commented 3 years ago

Sounds good @trsvchn