Update HF LLM fine tuning workflow to remove SSH setup and update PyTorchJob to not overwrite the container entrypoint

dmsuehir commented 2 weeks ago

Description

SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.

Changes Made

Updated the HF workflow pytorchjob.yaml to put the torchrun/python command in args instead of command
Removed setting the CCL_ATL_TRANSPORT (this defaults to mpi). I tested it both ways and with the default/mpi it performed best
Removed ssh installs/setup from the HF workflow Dockerfile, since that's not being done from the multinode IPEX base container
No doc update is needed because the location of the torchrun/python command in the k8s spec is abstracted out by the helm chart. The helm values, etc. stayed the same.
[x] The code follows the project's coding standards.
[x] No Intel Internal IP is present within the changes.
[ ] The documentation has been updated to reflect any changes in functionality.

Validation

I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.

[x] I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

tylertitsworth commented 2 weeks ago

Hi @dmsuehir, can you make a branch instead, following our new contributor documentation?

dmsuehir commented 2 weeks ago

Closing this PR from my fork. I opened a new PR #203 from a branch

intel / ai-containers