[Fine-tuning] Test QLoRA preset

Most of this PR is based on @albertoperdomo2 's early+ongoing work in https://github.com/openshift-psap/topsail/pull/519

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from kpouget. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openshift-psap/topsail/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

Jenkins Job #1471

:red_circle: Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 37 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_qlora_models

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1471/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//002_test_ci/FAILURES/view/):

/logs/artifacts/002_test_ci/000__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/002_test_ci/000__matbenchmarking/qlora/000__qlora/000__prepare_namespace/FAILURE | KeyError: 'rhoai/mistral-7b-v0.3-gptq'
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/prepare_finetuning.py", line 207, in prepare_namespace
    download_data_sources(test_settings)
  File "/opt/topsail/src/projects/fine_tuning/testing/prepare_finetuning.py", line 182, in download_data_sources
    download_from_registry(source_name)
  File "/opt/topsail/src/projects/fine_tuning/testing/prepare_finetuning.py", line 176, in download_from_registry
    image=sources[source_name].get("download_pod_image_key", None),
KeyError: 'rhoai/mistral-7b-v0.3-gptq'

[...]

[Test ran on the internal Perflab CI]

Jenkins Job #1476

:red_circle: Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 05 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS=ibm_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_qlora_models

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1476/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//000_prepare_ci/FAILURES/view/):

/logs/artifacts/000_prepare_ci/000__prepare2/000__prepare_namespace/FAILURE | TypeError: do_download() got an unexpected keyword argument 'image'
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/prepare_finetuning.py", line 207, in prepare_namespace
    download_data_sources(test_settings)
  File "/opt/topsail/src/projects/fine_tuning/testing/prepare_finetuning.py", line 184, in download_data_sources
    download_from_source(source_name)
  File "/opt/topsail/src/projects/fine_tuning/testing/prepare_finetuning.py", line 152, in download_from_source
    do_download(
TypeError: do_download() got an unexpected keyword argument 'image'

[Test ran on the internal Perflab CI]

Jenkins Job #1479

:red_circle: Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 32 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_qlora_models

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1479/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//000_test_ci/FAILURES/view/):

/logs/artifacts/000_test_ci/000__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/004__qlora/002__test_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 8, 'dataset_replication': 0.5, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'} --> 2
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/004__qlora/002__test_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/004__qlora/002__test_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 8, 'dataset_replication': 0.5, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 199, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Jenkins Job #1480

:red_circle: Test of 'rhoai test test_ci' failed after 06 hours 58 minutes 29 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_qlora_models

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1480/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//000_test_ci/FAILURES/view/):

/logs/artifacts/000_test_ci/000__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/001__qlora/002__test_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'llama-3.1-405b-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 8, 'dataset_replication': 0.5, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'} --> 2
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/001__qlora/002__test_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/001__qlora/002__test_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'llama-3.1-405b-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 8, 'dataset_replication': 0.5, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 199, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Jenkins Job #1481

:red_circle: Test of 'rhoai test test_ci' failed after 04 hours 26 minutes 19 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_qlora_models

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1481/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//000_test_ci/FAILURES/view/):

/logs/artifacts/000_test_ci/000__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/005__qlora/002__test_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'llama-3.1-405b-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 8, 'dataset_replication': 0.1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'} --> 2
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/005__qlora/002__test_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/005__qlora/002__test_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'llama-3.1-405b-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 8, 'dataset_replication': 0.1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 199, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Jenkins Job #1482

:green_circle: Test of 'rhoai test test_ci' succeeded after 00 hours 42 minutes 32 seconds. :green_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_qlora_models

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

openshift-psap / topsail

[Fine-tuning] Test QLoRA preset #540