[BUG] Frequent failure to push checkpoint to Hugging Face

tmostak commented 2 months ago

🐛 Bug

I've seen this issue in the past, but now I've had failures 5 times in a row trying to push a model (Llama 3 70B) I've trained with LoRA to Hugging Face, always failing with the error Your proposed upload is smaller than the minimum allowed size after uploading at least some of the safetensors files successfully to Hugging Face.

As mentioned above, this used to happen to me about 1 times in 3, but since yesterday it's occurred over and over, making it impossible to get my trained model to Hugging Face.

Thanks in advance for your help.

To Reproduce

Try pushing a checkpoint of a large model to Hugging Face. I am uploading LLama 3 70B trained with LoRA, and using the cpu_shard setting. Below are the logs

2024-07-23 12:31:09,477 - INFO: Initializing client True Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-23 12:31:10,477 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-23 12:31:11,453 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])] 2024-07-23 12:31:11,468 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id 128001. 2024-07-23 12:31:11,468 - INFO: Setting pretraining_tp of model config to 1. 2024-07-23 12:31:11,490 - INFO: Using bfloat16 for backbone 2024-07-23 12:31:11,490 - INFO: Using Flash Attention 2. 2024/07/23 12:45:37 # {"client":"fb52093e-4329-46ae-a696-5982cd3952c7","state":"DISCONNECT","t":"ws_disconnect"} 2024/07/23 12:46:01 # {"client":"a3d67d5a-1de3-4722-b740-c915a0fe61b6","state":"DISCONNECT","t":"ws_disconnect"} 2024-07-23 13:11:46,164 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] 2024-07-23 13:13:47,987 - INFO: Trainable parameters count: 6627000320 2024-07-23 13:13:47,987 - INFO: Total parameters count: 77180706816 2024-07-23 13:13:47,987 - INFO: Trainable %: 8.5863% 2024-07-23 13:16:05,354 - INFO: Weights loaded from: /home/ubuntu/h2o-llmstudio/output/user/heavyiq-llama-3-70b-16k-combo-v61-5-no-cte-judge-3584 -tokens-lora-r-512-a-1024-lr-1-1e-5/checkpoint.pth 2024-07-23 13:16:59,284 - INFO: Merging LORA layers with base model. 2024-07-23 13:16:59,555 - INFO: Enough space available for saving model weights.Required space: 138607.63MB, Available space: 17078424.52MB. Token has not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /home/ubuntu/.cache/huggingface/token Login successful model-00030-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 2.10G/2.10G [00:42<00:00, 50.0MB/s] model-00022-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:29<00:00, 52.1MB/s] model-00005-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:33<00:00, 49.9MB/s] model-00018-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:39<00:00, 50.4MB/s] model-00009-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.97G/4.97G [01:43<00:00, 48.2MB/s] model-00023-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:44<00:00, 47.8MB/s] model-00002-of-00030.safetensors: 83%|█████████████████████████████████████████████████████████▌ | 3.89G/4.66G [01:11<00:17, 43.4MB/s] HTTP Error 500 thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/6b/fa/6bfa07cdb8bf9a3ba2855419c22ebb5a4 c20017e3a6936f75b08e3656c46cb53/2d2e0bbc5dbd1e2cdd3c2395e53251ce7470cf5760533afb130f051f6f2302c9?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-S ha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20240723%2Fus-east-1%2Fs3%2Faws4request&X-Amz-Date=20240723T132539Z&X-Amz-Expires =86400&X-Amz-Signature=4be65d9cf19db19a64c12c7a54313fb242255b284d422114f21af1e266542b28&X-Amz-SignedHeaders=host&partNumber=225&uploadId=sjpYGTy KvrXiPiatCQ5Ifvo4XjLStAjQPLjAUJPoi9VTC0Nal6IxEkamVmZAV0SUXa5SX.ZPCvehTE8XYDmiOLJzt3RLoGqzzaoLe0fCdoKZwkyTFcy4H5tcDz4S1pL&x-id=UploadPart49.8MB/s] 2024-07-23 13:28:33,493 - WARNING: HTTP Error 500 thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/6b/f a/6bfa07cdb8bf9a3ba2855419c22ebb5a4c20017e3a6936f75b08e3656c46cb53/2d2e0bbc5dbd1e2cdd3c2395e53251ce7470cf5760533afb130f051f6f2302c9?X-Amz-Algorit hm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20240723%2Fus-east-1%2Fs3%2Faws4_request&X-Amz- Date=20240723T132539Z&X-Amz-Expires=86400&X-Amz-Signature=4be65d9cf19db19a64c12c7a54313fb242255b284d422114f21af1e266542b28&X-Amz-SignedHeaders=ho st&partNumber=225&uploadId=sjpYGTy_KvrXiPiatCQ5Ifvo4XjLStAjQPLjAUJPoi9VTC0Nal6IxEkamVmZAV0SUXa5SX.ZPCvehTE8XYDmiOLJzt3RLoGqzzaoLe0fCdoKZwkyTFcy4H 5tcDz4S1pL&x-id=UploadPart Retrying in 1s [Retry 1/5]. 2024-07-23 13:28:33,494 - WARNING: Retrying in 1s [Retry 1/5]. model-00002-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:26<00:00, 53.8MB/s] model-00014-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.97G/4.97G [01:39<00:00, 50.1MB/s] model-00011-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:37<00:00, 47.8MB/s] model-00013-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:35<00:00, 52.5MB/s] Upload 30 LFS files: 30%|██████████████████████████▍ | 9/30 [03:18<07:43, 22.09s/it] model-00006-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:35<00:00, 48.6MB/s] model-00020-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:31<00:00, 51.1MB/s] model-00015-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:24<00:00, 55.0MB/s] model-00012-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:37<00:00, 47.9MB/s] model-00025-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████████| 4.66G/4.66G [01:36<00:00, 48.1MB/s] 2024-07-23 13:31:00,958 - ERROR: Unknown exception████████████████████████████ | 2.64G/4.66G [00:53<00:41, 48.5MB/s] Traceback (most recent call last):100%|████████████████████████████████████████████████████████████████████▉| 4.66G/4.66G [01:36<00:00, 56.1MB/s] File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_sta tus response.raise_for_status() File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/complete_multipart?uploadId=sjpYGTy_KvrXiPiatCQ5 Ifvo4XjLStAjQPLjAUJPoi9VTC0Nal6IxEkamVmZAV0SUXa5SX.ZPCvehTE8XYDmiOLJzt3RLoGqzzaoLe0fCdoKZwkyTFcy4H5tcDz4S1pL&bucket=hf-hub-lfs-us-east-1&prefix=r epos%2F6b%2Ffa%2F6bfa07cdb8bf9a3ba2855419c22ebb5a4c20017e3a6936f75b08e3656c46cb53&expiration=Wed%2C+24+Jul+2024+13%3A25%3A39+GMT&signature=99919e 38c7392e45f36adc29e992396a252b30147d51959597db45a7ef3d8f55

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 401, in _wrapped_lfs_upload lfs_upload(operation=operation, lfs_batch_action=batch_action, token=token) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 228, in lfs_upload _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_action["href"]) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 334, in _upload_multi_part hf_raise_for_status(completion_res) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status raise BadRequestError(message, response=response) from e huggingface_hub.utils._errors.BadRequestError: (Request ID: Root=1-669fb01c-6c5307b74ffc56134e74ba3e;a9a2d887-decd-4987-be09-52458e216931)

Bad request: Your proposed upload is smaller than the minimum allowed size

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/handlers.py", line 358, in handle await experiment_push_to_huggingface_dialog(q) File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/sections/experiment.py", line 2012, in experiment_push_to_huggingface_dialog publish_model_to_hugging_face( File "/home/ubuntu/h2o-llmstudio/./llm_studio/app_utils/hugging_face_utils.py", line 267, in publish_model_to_hugging_face model.backbone.push_to_hub( File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2635, in push_to_hub return super().push_to_hub(*args, kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/transformers/utils/hub.py", line 894, in push_to_hub return self._upload_modified_files( File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/transformers/utils/hub.py", line 758, in _upload_modified_files return create_commit( File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, *kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1227, in _inner return fn(self, args, kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3762, in create_commit self.preupload_lfs_files( File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4262, in preupload_lfs_files _upload_lfs_files( File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 416, in _upload_lfs_files thread_map( File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map return _executor_map(ThreadPoolExecutor, fn, *iterables, *tqdm_kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map return list(tqdm_class(ex.map(fn, iterables, chunksize=chunksize), kwargs)) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_studio/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 403, in _wrapped_lfs_upload raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc RuntimeError: Error while uploading 'model-00013-of-00030.safetensors' to the Hub.

LLM Studio version

98854a148df19299802d931a7bc2040a1fa3cf98: Add new setting: prompt_column_separator

pascal-pfeiffer commented 2 months ago

Thank you for the issue. I personally never experienced it, so this is interesting. Is that with or without HF_HUB_ENABLE_HF_TRANSFER ? In either case, could you try using the other option and see if the issue persists? To make it easier to reproduce, could you share a cfg.yaml maybe? Like, was deepspeed used? I assume this is standard Causal LM modeling?

tmostak commented 2 months ago

Hi @pascal-pfeiffer, I didn't have HF_HUB_ENABLE_HF_TRANSFER enabled, but can try that.

Note that I tried a bit later and was able to successfully upload my model, so perhaps its intermittent/a network issue (I am using an 8XA100 machine on Lambda).

Let me see if this pops up again in the coming days and if so I will add the pertinent details.

Today I tried to upload a new model training on Llama 3.1 70B but hit a different issue, will make a separate issue for it.

pascal-pfeiffer commented 2 months ago

I was able to reproduce with LLama3.1 70B after also getting a HTTP Error 500 thrown while requesting PUT for one of the chunks. This is without HF_HUB_ENABLE_HF_TRANSFER. With the flag active (what I usually do), I have never seen the issue before, so I assume this is a network issue.

I didn't have HF_HUB_ENABLE_HF_TRANSFER enabled, but can try that.

For now, I'd suggest to use that and see if it already solves the issue.

pascal-pfeiffer commented 2 months ago

I hope #790 solves the issue for you, if not, please reopen.

h2oai / h2o-llmstudio