Closed dmsuehir closed 5 days ago
@tylertitsworth Reposted from a branch
✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.
Package | Version | Score | Details |
---|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.
Opened PR 238
Description
SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.
Changes Made
Updated the HF workflow pytorchjob.yaml to put the torchrun/python command in
args
instead ofcommand
Removed setting the CCL_ATL_TRANSPORT (this defaults to mpi). I tested it both ways and with the default/mpi it performed best
Removed ssh installs/setup from the HF workflow Dockerfile, since that's not being done from the multinode IPEX base container
No doc update is needed because the location of the torchrun/python command in the k8s spec is abstracted out by the helm chart. The helm values, etc. stayed the same.
[x] The code follows the project's coding standards.
[x] No Intel Internal IP is present within the changes.
[ ] The documentation has been updated to reflect any changes in functionality.
Validation
I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.
test_runner.py
with all existing tests passing, and I have added new tests where applicable.