philschmid / sagemaker-huggingface-llama-2-samples

86 stars 32 forks source link

RuntimeError: cannot reshape tensor #8

Open mikewadhera opened 11 months ago

mikewadhera commented 11 months ago

Thanks so much for providing scripts for Llama2 on Sagemaker.

I'm running the code from: https://www.philschmid.de/sagemaker-llama2-qlora

When fitting the model I get a RuntimeError:

ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
 0%|          | 0/276 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-07-28 01:56:33,101 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2023-07-28 01:56:48 Uploading - Uploading generated training model
2023-07-28 01:56:48 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
[<ipython-input-18-26cb6fd8d084>](https://localhost:8080/#) in <cell line: 5>()
      3 
      4 # starting the train job with our uploaded datasets as input
----> 5 huggingface_estimator.fit(data, wait=True)

5 frames
[/usr/local/lib/python3.10/dist-packages/sagemaker/session.py](https://localhost:8080/#) in _check_job_status(job, desc, status_key_name)
   6734                 actual_status=status,
   6735             )
-> 6736         raise exceptions.UnexpectedStatusException(
   6737             message=message,
   6738             allowed_statuses=["Completed", "Stopped"],

UnexpectedStatusException: Error for Training job huggingface-qlora-2023-07-28-01-46-14-2023-07-28-01-46-20-353: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
 0%|          | 0/276 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token <REDACTED> --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2", exit code: 1
philschmid commented 11 months ago

Did you make any modifications to the code? changed the instance? changed version in the requirements.txt ? anything different?