dmsuehir commented 2 weeks ago

Description

SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.

Changes Made

Updated the HF workflow pytorchjob.yaml to put the torchrun/python command in args instead of command
Removed setting the CCL_ATL_TRANSPORT (this defaults to mpi). I tested it both ways and with the default/mpi it performed best
Removed ssh installs/setup from the HF workflow Dockerfile, since that's not being done from the multinode IPEX base container
No doc update is needed because the location of the torchrun/python command in the k8s spec is abstracted out by the helm chart. The helm values, etc. stayed the same.
[x] The code follows the project's coding standards.
[x] No Intel Internal IP is present within the changes.
[ ] The documentation has been updated to reflect any changes in functionality.

Validation

I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.

[x] I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

dmsuehir commented 2 weeks ago

@tylertitsworth Reposted from a branch

github-actions[bot] commented 2 weeks ago

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

Package	Version	Score	Details

Scanned Manifest Files

github-advanced-security[bot] commented 2 weeks ago

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

dmsuehir commented 5 days ago

Opened PR 238

intel / ai-containers

Update HF LLM fine tuning workflow to remove SSH setup and update PyTorchJob to not overwrite the container entrypoint #203