Open mikewadhera opened 11 months ago
Thanks so much for providing scripts for Llama2 on Sagemaker.
I'm running the code from: https://www.philschmid.de/sagemaker-llama2-qlora
When fitting the model I get a RuntimeError:
ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous 0%| | 0/276 [00:00<?, ?it/s]" Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2" 2023-07-28 01:56:33,101 sagemaker-training-toolkit ERROR Encountered exit_code 1 2023-07-28 01:56:48 Uploading - Uploading generated training model 2023-07-28 01:56:48 Failed - Training job failed --------------------------------------------------------------------------- UnexpectedStatusException Traceback (most recent call last) [<ipython-input-18-26cb6fd8d084>](https://localhost:8080/#) in <cell line: 5>() 3 4 # starting the train job with our uploaded datasets as input ----> 5 huggingface_estimator.fit(data, wait=True) 5 frames [/usr/local/lib/python3.10/dist-packages/sagemaker/session.py](https://localhost:8080/#) in _check_job_status(job, desc, status_key_name) 6734 actual_status=status, 6735 ) -> 6736 raise exceptions.UnexpectedStatusException( 6737 message=message, 6738 allowed_statuses=["Completed", "Stopped"], UnexpectedStatusException: Error for Training job huggingface-qlora-2023-07-28-01-46-14-2023-07-28-01-46-20-353: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous 0%| | 0/276 [00:00<?, ?it/s]" Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2", exit code: 1
Did you make any modifications to the code? changed the instance? changed version in the requirements.txt ? anything different?
Thanks so much for providing scripts for Llama2 on Sagemaker.
I'm running the code from: https://www.philschmid.de/sagemaker-llama2-qlora
When fitting the model I get a RuntimeError: