intel / ai-containers

This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow and PyTorch that have been optimized for Intel platforms. Scaling is done with python, Docker, kubernetes, kubeflow, cnvrg.io, Helm, and other container orchestration frameworks for use in the cloud and on-premise
https://intel.github.io/ai-containers/
Apache License 2.0
19 stars 15 forks source link

Update HF LLM fine tuning workflow to remove SSH setup and update PyTorchJob to not overwrite the container entrypoint #189

Closed dmsuehir closed 2 weeks ago

dmsuehir commented 2 weeks ago

Description

SSH setup has been moved to the multinode base, so that can be removed from the workflow Dockerfile. Also, in order for the torch ccl setvars.sh to apply, I've switched the k8s PyTorchJob to not overwrite the entrypoint command.

Changes Made

Validation

I tested this with Llama 2 and the the medical meadow flashcard dataset to verify training/eval using the CCL backend with a base container from Sharvil and a rebuilt workflow container.

tylertitsworth commented 2 weeks ago

Hi @dmsuehir, can you make a branch instead, following our new contributor documentation?

dmsuehir commented 2 weeks ago

Closing this PR from my fork. I opened a new PR #203 from a branch