Closed kpouget closed 1 month ago
Jenkins Job #1567
:green_circle: Test of 'rhoai test test_ci' succeeded after 04 hours 17 minutes 23 seconds. :green_circle:
• Link to the test results.
• Link to the reports index.
Test configuration:
# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_lora_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_lora_qlora_models
• Link to the Rebuild page.
[Test ran on the internal Perflab CI]
Jenkins Job #1568
:red_circle: Test of 'rhoai test test_ci' failed after 03 hours 08 minutes 35 seconds. :red_circle:
• Link to the test results.
• Link to the reports index.
Test configuration:
# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS=ibm_lora_qlora_models
PR_POSITIONAL_ARG_0=fine_tuning-perf-ci
PR_POSITIONAL_ARG_1=ibm_lora_qlora_models
• Link to the Rebuild page.
[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1568/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//000_test_ci/FAILURES/view/):
/logs/artifacts/000_test_ci/000__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/004__qlora/002__test_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.5, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 1024, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 48, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'} --> 2
/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/004__qlora/002__test_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/000__matbenchmarking/qlora/004__qlora/002__test_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.5, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 1024, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 48, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py'}"' returned non-zero exit status 2.
Traceback (most recent call last):
File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 199, in _run_test
run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
proc = subprocess.run(command, **args)
[...]
[Test ran on the internal Perflab CI]
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from kpouget. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.