CI Pipeline which builds & tests the container

To make sure our Hugging Face DLC are well tested, we need to create "integration" tests that run different kinds of training using the container. Those tests should be run automatically or on-demand. We can use Github Actions as CI for running the tests and python + docker to implement the integration tests.

Until #3 is implemented, we can use existing Containers from, e.g. transformers to run the tests. For "tests" script, i think we can use existing "examples/" from transformers or peft trl. We could structure the tests/ folder maybe into:

local/ (run on a local machine GPU),
vertex (run on Vertex)
gke (run on GKE)

Example for a test:

build a container
starts a container on a GPU
runs a training using the container (few steps)
validates results
stops the container -> repeat 1-4. with other tests.

In addition to "local" tests running on GPU instances, we should also run validation tests for GKE and Vertex AI.

[ ] We need to implement strong CI tests, which run several tests, including training smaller models like BERT and bigger models Like Llama.
- [ ] We should test and validate PEFT
- [ ] Distributed Training
- [ ] Flash attention support
[ ] Tests directly running on Vertex AI or GKE using vertex SDK

huggingface / Google-Cloud-Containers

CI Pipeline which builds & tests the container #4