This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow and PyTorch that have been optimized for Intel platforms. Scaling is done with python, Docker, kubernetes, kubeflow, cnvrg.io, Helm, and other container orchestration frameworks for use in the cloud and on-premise
SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.
Changes Made
Updated the HF workflow pytorchjob.yaml to put the torchrun/python command in args instead of command
Removed setting the CCL_ATL_TRANSPORT (this defaults to mpi). I tested it both ways and with the default/mpi it performed best
Removed ssh installs/setup from the HF workflow Dockerfile, since that's not being done from the multinode IPEX base container
No doc update is needed because the location of the torchrun/python command in the k8s spec is abstracted out by the helm chart. The helm values, etc. stayed the same.
[x] No Intel Internal IP is present within the changes.
[ ] The documentation has been updated to reflect any changes in functionality.
Validation
I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.
[x] I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.
Description
SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.
Changes Made
Updated the HF workflow pytorchjob.yaml to put the torchrun/python command in
args
instead ofcommand
Removed setting the CCL_ATL_TRANSPORT (this defaults to mpi). I tested it both ways and with the default/mpi it performed best
Removed ssh installs/setup from the HF workflow Dockerfile, since that's not being done from the multinode IPEX base container
No doc update is needed because the location of the torchrun/python command in the k8s spec is abstracted out by the helm chart. The helm values, etc. stayed the same.
[x] The code follows the project's coding standards.
[x] No Intel Internal IP is present within the changes.
[ ] The documentation has been updated to reflect any changes in functionality.
Validation
I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.
test_runner.py
with all existing tests passing, and I have added new tests where applicable.