Open Amerehei opened 2 weeks ago
accelerate launch --config_file "configs/fsdp_config.yaml" train.py \ --seed 100 \ --seed 100 --model_name_or_path "meta-llama/Llama-2-7b-hf" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --push_to_hub --hub_private_repo True --hub_strategy "every_save" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "mistral-sft-lora-fsdp" --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn True --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "all-linear" --use_4bit_quantization False config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 1.96MB/s] model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 69.7MB/s] model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [03:51<00:00, 43.0MB/s] model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:17<00:00, 202MB/s] Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.72s/it] Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.71s/it] Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.71s/it] Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.71s/it] Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 105.78s/it]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.71s/it] Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.73s/it] Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.73s/it] Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:09<00:00, 124.73s/it] Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.50it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.44it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.30it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.37it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.47it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.33it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.29it/s] generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 1.53MB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 776/776 [00:00<00:00, 7.62MB/s] tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 38.8MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 16.9MB/s] special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 1.49MB/s] The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` Loading checkpoint shards: 50%|██████████████████████████████████████████████████████████████████████▌ | 1/2 [00:07<00:07, 7.58s/it]The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.09s/it] The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 524/524 [00:00<00:00, 4.88MB/s] train-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.2M/35.2M [00:00<00:00, 42.5MB/s] test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.08M/7.08M [00:00<00:00, 42.3MB/s] Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 44326.80 examples/s] Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 49700.55 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 761.49 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 789.50 examples/s] Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2276.91 examples/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2460.44 examples/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1008.03 examples/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} [rank3]:[W1108 16:43:16.179385976 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank5]:[W1108 16:43:16.198058004 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank2]:[W1108 16:43:16.203509811 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank6]:[W1108 16:43:16.216316855 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank1]:[W1108 16:43:16.240819550 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank7]:[W1108 16:43:16.249535497 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} [rank4]:[W1108 16:43:20.883546315 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Generating train split: 8 examples [00:00, 291.34 examples/s] Generating train split: 8 examples [00:00, 541.47 examples/s] [rank0]:[W1108 16:43:55.752157448 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [2024-11-08 16:43:58,359] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory [2024-11-08 16:43:58,441] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-08 16:43:58,446] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-08 16:43:58,447] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-08 16:43:58,516] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-08 16:43:58,578] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-08 16:43:58,629] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-08 16:43:58,689] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) Using auto half precision backend trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 PeftModelForCausalLM( (base_model): LoraModel( (model): LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32008, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaFlashAttention2( (q_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=11008, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=11008, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=11008, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=11008, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=11008, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=11008, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) ) ) (norm): LlamaRMSNorm((4096,), eps=1e-05) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=4096, out_features=32008, bias=False) ) ) ) trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 ***** Running training ***** Num examples = 8 Num Epochs = 1 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 256 Gradient Accumulation steps = 4 Total optimization steps = 1 Number of trainable parameters = 2,498,560 Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 2 wandb: You chose 'Use an existing W&B account' wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server) wandb: You can find your API key in your browser here: https://wandb.ai/authorize wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc wandb: Tracking run with wandb version 0.18.6 wandb: Run data is saved locally in /workspace/wandb/run-20241108_164934-d2cvs1zs wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run mistral-sft-lora-fsdp wandb: ⭐️ View project at https://wandb.ai/a-amerehi/huggingface wandb: 🚀 View run at https://wandb.ai/a-amerehi/huggingface/runs/d2cvs1zs 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.91s/it]/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:732: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. local_shape = tensor.shape /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:744: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.shape, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:746: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.dtype, /usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:747: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. tensor.device, /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/config.json /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( Model config LlamaConfig { "_name_or_path": "meta-llama/Llama-2-7b-hf", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.47.0.dev0", "use_cache": true, "vocab_size": 32000 } /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:260: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning. warnings.warn( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( /usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py:108: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. dist_cp.save_state_dict( [rank6]: Traceback (most recent call last): [rank6]: File "/workspace/train.py", line 155, in[rank6]: main(model_args, data_args, training_args) [rank6]: File "/workspace/train.py", line 139, in main [rank6]: trainer.train(resume_from_checkpoint=checkpoint) [rank6]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank6]: return inner_training_loop( [rank6]: ^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank6]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank6]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank6]: self._save_checkpoint(model, trial) [rank6]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank6]: self._save_optimizer_and_scheduler(output_dir) [rank6]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank6]: save_fsdp_optimizer( [rank6]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank6]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank6]: return FullyShardedDataParallel._optim_state_dict_impl( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank6]: return _optim_state_dict( [rank6]: ^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank6]: return func(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank6]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank6]: nested_unflat_param_names = [ [rank6]: ^ [rank6]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank6]: param_to_fqns[param] for param in param_group_params [rank6]: ~~~~~~~~~~~~~^^^^^^^ [rank6]: KeyError: Parameter containing: [rank6]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank6]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank6]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank6]: ..., [rank6]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank6]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank6]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank6]: device='cuda:6', requires_grad=True) [rank3]: Traceback (most recent call last): [rank3]: File "/workspace/train.py", line 155, in [rank3]: main(model_args, data_args, training_args) [rank3]: File "/workspace/train.py", line 139, in main [rank3]: trainer.train(resume_from_checkpoint=checkpoint) [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank3]: return inner_training_loop( [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank3]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank3]: self._save_checkpoint(model, trial) [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank3]: self._save_optimizer_and_scheduler(output_dir) [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank3]: save_fsdp_optimizer( [rank3]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank3]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank3]: return FullyShardedDataParallel._optim_state_dict_impl( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank3]: return _optim_state_dict( [rank3]: ^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank3]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank3]: nested_unflat_param_names = [ [rank3]: ^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank3]: param_to_fqns[param] for param in param_group_params [rank3]: ~~~~~~~~~~~~~^^^^^^^ [rank3]: KeyError: Parameter containing: [rank3]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank3]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank3]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank3]: ..., [rank3]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank3]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank3]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank3]: device='cuda:3', requires_grad=True) [rank2]: Traceback (most recent call last): [rank2]: File "/workspace/train.py", line 155, in [rank2]: main(model_args, data_args, training_args) [rank2]: File "/workspace/train.py", line 139, in main [rank2]: trainer.train(resume_from_checkpoint=checkpoint) [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank2]: return inner_training_loop( [rank2]: ^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank2]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank2]: self._save_checkpoint(model, trial) [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank2]: self._save_optimizer_and_scheduler(output_dir) [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank2]: save_fsdp_optimizer( [rank2]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank2]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank2]: return FullyShardedDataParallel._optim_state_dict_impl( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank2]: return _optim_state_dict( [rank2]: ^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank2]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank2]: nested_unflat_param_names = [ [rank2]: ^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank2]: param_to_fqns[param] for param in param_group_params [rank2]: ~~~~~~~~~~~~~^^^^^^^ [rank2]: KeyError: Parameter containing: [rank2]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank2]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank2]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank2]: ..., [rank2]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank2]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank2]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank2]: device='cuda:2', requires_grad=True) [rank4]: Traceback (most recent call last): [rank4]: File "/workspace/train.py", line 155, in [rank4]: main(model_args, data_args, training_args) [rank4]: File "/workspace/train.py", line 139, in main [rank4]: trainer.train(resume_from_checkpoint=checkpoint) [rank4]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank4]: return inner_training_loop( [rank4]: ^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank4]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank4]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank4]: self._save_checkpoint(model, trial) [rank4]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank4]: self._save_optimizer_and_scheduler(output_dir) [rank4]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank4]: save_fsdp_optimizer( [rank4]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank4]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank4]: return FullyShardedDataParallel._optim_state_dict_impl( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank4]: return _optim_state_dict( [rank4]: ^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank4]: return func(*args, **kwargs) [rank4]: ^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank4]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank4]: nested_unflat_param_names = [ [rank4]: ^ [rank4]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank4]: param_to_fqns[param] for param in param_group_params [rank4]: ~~~~~~~~~~~~~^^^^^^^ [rank4]: KeyError: Parameter containing: [rank4]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank4]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank4]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank4]: ..., [rank4]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank4]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank4]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank4]: device='cuda:4', requires_grad=True) [rank7]: Traceback (most recent call last): [rank7]: File "/workspace/train.py", line 155, in [rank7]: main(model_args, data_args, training_args) [rank7]: File "/workspace/train.py", line 139, in main [rank7]: trainer.train(resume_from_checkpoint=checkpoint) [rank7]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank7]: return inner_training_loop( [rank7]: ^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank7]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank7]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank7]: self._save_checkpoint(model, trial) [rank7]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank7]: self._save_optimizer_and_scheduler(output_dir) [rank7]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank7]: save_fsdp_optimizer( [rank7]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank7]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank7]: return FullyShardedDataParallel._optim_state_dict_impl( [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank7]: return _optim_state_dict( [rank7]: ^^^^^^^^^^^^^^^^^^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank7]: return func(*args, **kwargs) [rank7]: ^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank7]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank7]: nested_unflat_param_names = [ [rank7]: ^ [rank7]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank7]: param_to_fqns[param] for param in param_group_params [rank7]: ~~~~~~~~~~~~~^^^^^^^ [rank7]: KeyError: Parameter containing: [rank7]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank7]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank7]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank7]: ..., [rank7]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank7]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank7]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank7]: device='cuda:7', requires_grad=True) [rank5]: Traceback (most recent call last): [rank5]: File "/workspace/train.py", line 155, in [rank5]: main(model_args, data_args, training_args) [rank5]: File "/workspace/train.py", line 139, in main [rank5]: trainer.train(resume_from_checkpoint=checkpoint) [rank5]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank5]: return inner_training_loop( [rank5]: ^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank5]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank5]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank5]: self._save_checkpoint(model, trial) [rank5]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank5]: self._save_optimizer_and_scheduler(output_dir) [rank5]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank5]: save_fsdp_optimizer( [rank5]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank5]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank5]: return FullyShardedDataParallel._optim_state_dict_impl( [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank5]: return _optim_state_dict( [rank5]: ^^^^^^^^^^^^^^^^^^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank5]: return func(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank5]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank5]: nested_unflat_param_names = [ [rank5]: ^ [rank5]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank5]: param_to_fqns[param] for param in param_group_params [rank5]: ~~~~~~~~~~~~~^^^^^^^ [rank5]: KeyError: Parameter containing: [rank5]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank5]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank5]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank5]: ..., [rank5]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank5]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank5]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank5]: device='cuda:5', requires_grad=True) Traceback (most recent call last): File "/workspace/train.py", line 155, in main(model_args, data_args, training_args) File "/workspace/train.py", line 139, in main trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate self._save_checkpoint(model, trial) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint self._save_optimizer_and_scheduler(output_dir) [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/train.py", line 155, in [rank1]: main(model_args, data_args, training_args) [rank1]: File "/workspace/train.py", line 139, in main [rank1]: trainer.train(resume_from_checkpoint=checkpoint) [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank1]: return inner_training_loop( [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank1]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank1]: self._save_checkpoint(model, trial) [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank1]: self._save_optimizer_and_scheduler(output_dir) [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank1]: save_fsdp_optimizer( [rank1]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank1]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank1]: return FullyShardedDataParallel._optim_state_dict_impl( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank1]: return _optim_state_dict( [rank1]: ^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank1]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank1]: nested_unflat_param_names = [ [rank1]: ^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank1]: param_to_fqns[param] for param in param_group_params [rank1]: ~~~~~~~~~~~~~^^^^^^^ [rank1]: KeyError: Parameter containing: [rank1]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank1]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank1]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank1]: ..., [rank1]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank1]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank1]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank1]: device='cuda:1', requires_grad=True) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler save_fsdp_optimizer( File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer optim_state = FSDP.optim_state_dict(model, optimizer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict return FullyShardedDataParallel._optim_state_dict_impl( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl return _optim_state_dict( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict fsdp_osd["param_groups"] = _unflatten_param_groups( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups nested_unflat_param_names = [ ^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in param_to_fqns[param] for param in param_group_params ~~~~~~~~~~~~~^^^^^^^ KeyError: Parameter containing: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], ..., [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], device='cuda:0', requires_grad=True) [rank0]: Traceback (most recent call last): [rank0]: File "/workspace/train.py", line 155, in [rank0]: main(model_args, data_args, training_args) [rank0]: File "/workspace/train.py", line 139, in main [rank0]: trainer.train(resume_from_checkpoint=checkpoint) [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2132, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2562, in _inner_training_loop [rank0]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3025, in _maybe_log_save_evaluate [rank0]: self._save_checkpoint(model, trial) [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3160, in _save_checkpoint [rank0]: self._save_optimizer_and_scheduler(output_dir) [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3276, in _save_optimizer_and_scheduler [rank0]: save_fsdp_optimizer( [rank0]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/fsdp_utils.py", line 186, in save_fsdp_optimizer [rank0]: optim_state = FSDP.optim_state_dict(model, optimizer) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1890, in optim_state_dict [rank0]: return FullyShardedDataParallel._optim_state_dict_impl( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1301, in _optim_state_dict_impl [rank0]: return _optim_state_dict( [rank0]: ^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 2015, in _optim_state_dict [rank0]: fsdp_osd["param_groups"] = _unflatten_param_groups( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1271, in _unflatten_param_groups [rank0]: nested_unflat_param_names = [ [rank0]: ^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/distributed/fsdp/_optim_utils.py", line 1272, in [rank0]: param_to_fqns[param] for param in param_group_params [rank0]: ~~~~~~~~~~~~~^^^^^^^ [rank0]: KeyError: Parameter containing: [rank0]: tensor([[ 0.0007, -0.0035, -0.0132, ..., 0.0048, 0.0075, -0.0131], [rank0]: [-0.0077, 0.0071, 0.0069, ..., 0.0037, 0.0114, -0.0142], [rank0]: [-0.0058, 0.0103, -0.0030, ..., -0.0134, 0.0156, 0.0019], [rank0]: ..., [rank0]: [ 0.0084, 0.0016, -0.0019, ..., -0.0135, -0.0142, -0.0084], [rank0]: [-0.0133, -0.0083, 0.0022, ..., -0.0101, 0.0025, -0.0026], [rank0]: [ 0.0148, -0.0037, 0.0084, ..., -0.0073, -0.0091, 0.0124]], [rank0]: device='cuda:0', requires_grad=True) wandb: 🚀 View run mistral-sft-lora-fsdp at: https://wandb.ai/a-amerehi/huggingface/runs/d2cvs1zs wandb: Find logs at: wandb/run-20241108_164934-d2cvs1zs/logs W1108 16:49:59.928000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2243 closing signal SIGTERM W1108 16:49:59.930000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2244 closing signal SIGTERM W1108 16:49:59.930000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2245 closing signal SIGTERM W1108 16:49:59.931000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2247 closing signal SIGTERM W1108 16:49:59.931000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2248 closing signal SIGTERM W1108 16:49:59.931000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2249 closing signal SIGTERM W1108 16:49:59.931000 2163 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2250 closing signal SIGTERM E1108 16:50:01.148000 2163 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 2246) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1155, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-11-08_16:49:59 host : e3997253d925 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2246) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
I have one more question, why are there many warnings the log, specially deprecated warnings
@BenjaminBossan @qgallouedec Any idea?
Sorry for the delay in replying @Amerehei we're currently at a company offsite. Hopefully at the start of the next week, I'll have the opportunity to try to reproduce and will report back.
Thanks Benjamin for your response
I finally got around to testing this. I tried to stick closely to your settings but used only 2 GPUs and reduced some numbers like batch size for memory. Regarding the packages, I use trl 0.12.1 and torch 2.5.1. At first, training seemed to run fine. But when I changed the save_strategy
to every 3 steps to trigger a checkpoint more quickly, I got the same error as you. So I assume that for you, the training itself also works, it's just that the model checkpoint is failing.
As a next step, I switched to a much smaller model (opt-125m) and tried full fine-tuning to check if the error is PEFT-related. Interestingly, I got the same type of error (KeyError: Parameter containing: ...
). This makes it likely that the issue is not directly PEFT-related. It could instead be an error in the train.py
script or an error with the SFTTrainer
or accelerate. I tried an older trl version (0.10.1) but still the same error. Downgrading accelerate resulted in other errors. Searching for the error message, I didn't find much at all.
All this leaves me a bit puzzled. Tentatively pinging @muellerzr in case he has come across this error or knows someone else who might have.
PS: Also tried fsdp_use_orig_params: true
but no luck.
I have the same issue while saving a fine tuned model with QLoRA.
Thanks for the additional feedback. I did some more testing and I could get the checkpoint to work by downgrading to the following packages:
Note that those are most likely not the exact maximal versions, but it's very hard to figure those out as I had to change all 4 of them together, as there are mutual dependiencies.
@vrancurel @Amerehei It would be great if you could test this out and report back if those versions solve the issue for you too. If that's the case, it confirms my suspicion that the error is not PEFT related.
@BenjaminBossan I'm not sure if I did it right, but I have different problem
I ran the following command to downgrade libraries
pip install trl==0.11.0 "tokenizers>=0.19,<0.20" transformers==4.44.2 accelerate==0.33.0
after running the model I have
Running command: accelerate launch --config_file configs/fsdp_config.yaml train.py --seed 100 --model_name_or_path meta-llama/Llama-2-7b-hf --dataset_name smangrul/ultrachat-10k-chatml --chat_template_format chatml --add_special_tokens False --append_concat_token False --splits train,test --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level info --logging_strategy steps --eval_strategy epoch --save_strategy epoch --push_to_hub --hub_private_repo True --hub_strategy every_save --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type cosine --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir mistral-sft-lora-fsdp --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field content --use_flash_attn True --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules all-linear --use_4bit_quantization False config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 2.50MB/s] [rank3]: Traceback (most recent call last): [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1603, in _get_module [rank3]: return importlib.import_module("." + module_name, self.__name__) [rank3]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module [rank3]: return _bootstrap._gcd_import(name[level:], package, level) [rank3]: File "", line 1050, in _gcd_import [rank3]: File " ", line 1027, in _find_and_load [rank3]: File " ", line 1006, in _find_and_load_unlocked [rank3]: File " ", line 688, in _load_unlocked [rank3]: File " ", line 883, in exec_module [rank3]: File " ", line 241, in _call_with_frames_removed [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 32, in [rank3]: from ...modeling_flash_attention_utils import _flash_attention_forward [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_flash_attention_utils.py", line 27, in [rank3]: from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa [rank3]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/__init__.py", line 3, in [rank3]: from flash_attn.flash_attn_interface import ( [rank3]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in [rank3]: import flash_attn_2_cuda as flash_attn_cuda [rank3]: ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank3]: The above exception was the direct cause of the following exception: [rank3]: Traceback (most recent call last): [rank3]: File "/workspace/train.py", line 155, in [rank3]: main(model_args, data_args, training_args) [rank3]: File "/workspace/train.py", line 101, in main [rank3]: model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args) [rank3]: File "/workspace/utils.py", line 141, in create_and_prepare_model [rank3]: model = AutoModelForCausalLM.from_pretrained( [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained [rank3]: model_class = _get_model_class(config, cls._model_mapping) [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class [rank3]: supported_models = model_mapping[type(config)] [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__ [rank3]: return self._load_attr_from_module(model_type, model_name) [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module [rank3]: return getattribute_from_module(self._modules[module_name], attr) [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module [rank3]: if hasattr(module, attr): [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1593, in __getattr__ [rank3]: module = self._get_module(self._class_to_module[name]) [rank3]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1605, in _get_module [rank3]: raise RuntimeError( [rank3]: RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback): [rank3]: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1603, in _get_module [rank0]: return importlib.import_module("." + module_name, self.__name__) [rank0]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module [rank0]: return _bootstrap._gcd_import(name[level:], package, level) [rank0]: File " ", line 1050, in _gcd_import [rank0]: File " ", line 1027, in _find_and_load [rank0]: File " ", line 1006, in _find_and_load_unlocked [rank0]: File " ", line 688, in _load_unlocked [rank0]: File " ", line 883, in exec_module [rank0]: File " ", line 241, in _call_with_frames_removed [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 32, in [rank0]: from ...modeling_flash_attention_utils import _flash_attention_forward [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_flash_attention_utils.py", line 27, in [rank0]: from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa [rank0]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/__init__.py", line 3, in [rank0]: from flash_attn.flash_attn_interface import ( [rank0]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in [rank0]: import flash_attn_2_cuda as flash_attn_cuda [rank0]: ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: File "/workspace/train.py", line 155, in [rank0]: main(model_args, data_args, training_args) [rank0]: File "/workspace/train.py", line 101, in main [rank0]: model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args) [rank0]: File "/workspace/utils.py", line 141, in create_and_prepare_model [rank0]: model = AutoModelForCausalLM.from_pretrained( [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained [rank0]: model_class = _get_model_class(config, cls._model_mapping) [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class [rank0]: supported_models = model_mapping[type(config)] [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__ [rank0]: return self._load_attr_from_module(model_type, model_name) [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module [rank0]: return getattribute_from_module(self._modules[module_name], attr) [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module [rank0]: if hasattr(module, attr): [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1593, in __getattr__ [rank0]: module = self._get_module(self._class_to_module[name]) [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1605, in _get_module [rank0]: raise RuntimeError( [rank0]: RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback): [rank0]: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank2]: Traceback (most recent call last): [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1603, in _get_module [rank2]: return importlib.import_module("." + module_name, self.__name__) [rank2]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module [rank2]: return _bootstrap._gcd_import(name[level:], package, level) [rank2]: File " ", line 1050, in _gcd_import [rank2]: File " ", line 1027, in _find_and_load [rank2]: File " ", line 1006, in _find_and_load_unlocked [rank2]: File " ", line 688, in _load_unlocked [rank2]: File " ", line 883, in exec_module [rank2]: File " ", line 241, in _call_with_frames_removed [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 32, in [rank2]: from ...modeling_flash_attention_utils import _flash_attention_forward [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_flash_attention_utils.py", line 27, in [rank2]: from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa [rank2]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/__init__.py", line 3, in [rank2]: from flash_attn.flash_attn_interface import ( [rank2]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in [rank2]: import flash_attn_2_cuda as flash_attn_cuda [rank2]: ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank2]: The above exception was the direct cause of the following exception: [rank2]: Traceback (most recent call last): [rank2]: File "/workspace/train.py", line 155, in [rank2]: main(model_args, data_args, training_args) [rank2]: File "/workspace/train.py", line 101, in main [rank2]: model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args) [rank2]: File "/workspace/utils.py", line 141, in create_and_prepare_model [rank2]: model = AutoModelForCausalLM.from_pretrained( [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained [rank2]: model_class = _get_model_class(config, cls._model_mapping) [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class [rank2]: supported_models = model_mapping[type(config)] [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__ [rank2]: return self._load_attr_from_module(model_type, model_name) [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module [rank2]: return getattribute_from_module(self._modules[module_name], attr) [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module [rank2]: if hasattr(module, attr): [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1593, in __getattr__ [rank2]: module = self._get_module(self._class_to_module[name]) [rank2]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1605, in _get_module [rank2]: raise RuntimeError( [rank2]: RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback): [rank2]: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank1]: Traceback (most recent call last): [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1603, in _get_module [rank1]: return importlib.import_module("." + module_name, self.__name__) [rank1]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module [rank1]: return _bootstrap._gcd_import(name[level:], package, level) [rank1]: File " ", line 1050, in _gcd_import [rank1]: File " ", line 1027, in _find_and_load [rank1]: File " ", line 1006, in _find_and_load_unlocked [rank1]: File " ", line 688, in _load_unlocked [rank1]: File " ", line 883, in exec_module [rank1]: File " ", line 241, in _call_with_frames_removed [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 32, in [rank1]: from ...modeling_flash_attention_utils import _flash_attention_forward [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_flash_attention_utils.py", line 27, in [rank1]: from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa [rank1]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/__init__.py", line 3, in [rank1]: from flash_attn.flash_attn_interface import ( [rank1]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in [rank1]: import flash_attn_2_cuda as flash_attn_cuda [rank1]: ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE [rank1]: The above exception was the direct cause of the following exception: [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/train.py", line 155, in [rank1]: main(model_args, data_args, training_args) [rank1]: File "/workspace/train.py", line 101, in main [rank1]: model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args) [rank1]: File "/workspace/utils.py", line 141, in create_and_prepare_model [rank1]: model = AutoModelForCausalLM.from_pretrained( [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained [rank1]: model_class = _get_model_class(config, cls._model_mapping) [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class [rank1]: supported_models = model_mapping[type(config)] [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__ [rank1]: return self._load_attr_from_module(model_type, model_name) [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module [rank1]: return getattribute_from_module(self._modules[module_name], attr) [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module [rank1]: if hasattr(module, attr): [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1593, in __getattr__ [rank1]: module = self._get_module(self._class_to_module[name]) [rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1605, in _get_module [rank1]: raise RuntimeError( [rank1]: RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback): [rank1]: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE W1121 13:30:42.651000 1098 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1167 closing signal SIGTERM E1121 13:30:42.815000 1098 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1164) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1093, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-11-21_13:30:42 host : 9fd6b25bf7af rank : 1 (local_rank: 1) exitcode : 1 (pid: 1165) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-11-21_13:30:42 host : 9fd6b25bf7af rank : 2 (local_rank: 2) exitcode : 1 (pid: 1166) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-11-21_13:30:42 host : 9fd6b25bf7af rank : 0 (local_rank: 0) exitcode : 1 (pid: 1164) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Package Version --------------------------------- ------------- absl-py 2.1.0 accelerate 0.33.0 aiohappyeyeballs 2.4.3 aiohttp 3.11.6 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.0.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 async-timeout 5.0.1 attrs 23.1.0 Babel 2.13.1 beautifulsoup4 4.12.2 bitsandbytes 0.44.1 bleach 6.1.0 blinker 1.4 certifi 2022.12.7 cffi 1.16.0 charset-normalizer 2.1.1 click 8.1.7 comm 0.2.0 contourpy 1.3.1 cryptography 3.4.8 cut-cross-entropy 24.11.4 cycler 0.12.1 datasets 3.1.0 datatrove 0.3.0 dbus-python 1.2.18 debugpy 1.8.0 decorator 5.1.1 deepspeed 0.15.4 defusedxml 0.7.1 Deprecated 1.2.15 dill 0.3.8 distro 1.7.0 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 entrypoints 0.4 evaluate 0.4.3 exceptiongroup 1.1.3 executing 2.0.1 fastjsonschema 2.18.1 filelock 3.9.0 flash-attn 2.7.0.post2 fonttools 4.55.0 fqdn 1.5.1 frozenlist 1.5.0 fsspec 2024.9.0 gitdb 4.0.11 GitPython 3.1.43 grpcio 1.68.0 hf_transfer 0.1.8 hjson 3.1.0 httplib2 0.20.2 huggingface-hub 0.26.2 humanize 4.11.0 idna 3.4 importlib-metadata 4.6.4 ipykernel 6.26.0 ipython 8.17.2 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 jeepney 0.7.1 Jinja2 3.1.2 joblib 1.4.2 json5 0.9.14 jsonpointer 2.4 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter-archive 3.4.0 jupyter_client 7.4.9 jupyter-contrib-core 0.4.2 jupyter-contrib-nbextensions 0.7.0 jupyter_core 5.5.0 jupyter-events 0.9.0 jupyter-highlight-selected-word 0.2.0 jupyter-lsp 2.2.0 jupyter-nbextensions-configurator 0.6.3 jupyter_server 2.10.0 jupyter_server_terminals 0.4.4 jupyterlab 4.0.8 jupyterlab-pygments 0.2.2 jupyterlab_server 2.25.0 jupyterlab-widgets 3.0.9 keyring 23.5.0 kiwisolver 1.4.7 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 loguru 0.7.2 lxml 4.9.3 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.2 matplotlib 3.9.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mistune 3.0.2 more-itertools 8.10.0 mpmath 1.3.0 msgpack 1.1.0 multidict 6.1.0 multiprocess 0.70.16 nbclassic 1.0.0 nbclient 0.9.0 nbconvert 7.11.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.0 ninja 1.11.1.1 nltk 3.9.1 notebook 6.5.5 notebook_shim 0.2.3 numpy 1.26.4 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 oauthlib 3.2.0 overrides 7.4.0 packaging 23.2 pandas 2.2.3 pandocfilters 1.5.0 parso 0.8.3 peft 0.13.3.dev0 pexpect 4.8.0 Pillow 9.3.0 pip 23.3.1 platformdirs 3.11.0 prometheus-client 0.18.0 prompt-toolkit 3.0.39 propcache 0.2.0 protobuf 3.20.3 psutil 5.9.6 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 18.0.0 pycparser 2.21 pydantic 2.10.0 pydantic_core 2.27.0 PyGithub 2.5.0 Pygments 2.16.1 PyGObject 3.42.1 PyJWT 2.10.0 PyNaCl 1.5.0 pyparsing 2.4.7 python-apt 2.4.0+ubuntu2 python-dateutil 2.8.2 python-json-logger 2.0.7 pytz 2024.2 PyYAML 6.0.1 pyzmq 24.0.1 referencing 0.30.2 regex 2024.11.6 requests 2.32.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.9.4 rpds-py 0.12.0 safetensors 0.4.5 scikit-learn 1.5.2 scipy 1.14.1 SecretStorage 3.3.1 Send2Trash 1.8.2 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.4 setuptools 68.2.2 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 soupsieve 2.5 stack-data 0.6.3 sympy 1.13.1 tensorboard 2.18.0 tensorboard-data-server 0.7.2 terminado 0.17.1 threadpoolctl 3.5.0 tiktoken 0.8.0 tinycss2 1.2.1 tokenizers 0.19.1 tomli 2.0.1 torch 2.5.1 torchaudio 2.1.0+cu118 torchvision 0.16.0+cu118 tornado 6.3.3 tqdm 4.67.0 traitlets 5.13.0 transformers 4.44.2 triton 3.1.0 trl 0.11.0 types-python-dateutil 2.8.19.14 typing_extensions 4.12.2 tyro 0.9.1 tzdata 2024.2 unsloth 2024.11.8 unsloth_zoo 2024.11.6 uri-template 1.3.0 urllib3 1.26.13 wadllib 1.3.6 wandb 0.18.7 wcwidth 0.2.9 webcolors 1.13 webencodings 0.5.1 websocket-client 1.6.4 Werkzeug 3.1.3 wheel 0.45.0 widgetsnbextension 4.0.9 wrapt 1.16.0 xformers 0.0.28.post3 xxhash 3.5.0 yarl 1.17.2 zipp 1.0.0
Thanks for trying it out @Amerehei. The error seems to be caused by flash attention 2 and is probably unrelated to the initial issue. Could you try rebuilding the package or not using flash attention?
@BenjaminBossan I've set --use_flash_attn False
and in another run entierly removed it from the command.
In both cases I got the same error
I also commented attn_implementation
param passed to AutoModelForCausalLM.from_pretrained
Same problem with attn_implementation="sdpa"
@BenjaminBossan I found my mistake, pytorch 2.1 image was selected by default in today's run, forget the fast attention problem, Here is the new experiment result:
Running command: accelerate launch --config_file configs/fsdp_config.yaml train.py --seed 100 --model_name_or_path meta-llama/Llama-2-7b-hf --dataset_name smangrul/ultrachat-10k-chatml --chat_template_format chatml --add_special_tokens False --append_concat_token False --splits train,test --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level info --logging_strategy steps --eval_strategy epoch --save_strategy epoch --push_to_hub --hub_private_repo True --hub_strategy every_save --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type cosine --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir mistral-sft-lora-fsdp --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field content --use_flash_attn True --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules all-linear --use_4bit_quantization False Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9.16it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7.49it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.98it/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} [rank3]:[W1121 14:40:51.201069492 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank1]:[W1121 14:40:51.326864873 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank2]:[W1121 14:40:51.409507721 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.39s/it] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} [rank0]:[W1121 14:41:01.484323943 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Using auto half precision backend trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 PeftModelForCausalLM( (base_model): LoraModel( (model): LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32008, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaFlashAttention2( (q_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=11008, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=11008, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=11008, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=11008, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=11008, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=11008, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) ) ) (norm): LlamaRMSNorm((4096,), eps=1e-05) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=4096, out_features=32008, bias=False) ) ) ) trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 [2024-11-21 14:41:03,977] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-21 14:41:04,021] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-21 14:41:04,027] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-21 14:41:04,038] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) ***** Running training ***** Num examples = 8 Num Epochs = 1 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 128 Gradient Accumulation steps = 4 Total optimization steps = 1 Number of trainable parameters = 4,997,120 Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.18.7 wandb: W&B syncing is set to `offline` in this directory. wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. 0%| | 0/1 [00:00, ?it/s]Traceback (most recent call last): File "/workspace/train.py", line 155, inmain(model_args, data_args, training_args) File "/workspace/train.py", line 139, in main trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train output = super().train(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step self.optimizer.step(closure) File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper return func.__get__(opt, opt.__class__)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper out = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad ret = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step adamw( File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw func( File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding [rank2]: Traceback (most recent call last): [rank2]: File "/workspace/train.py", line 155, in [rank2]: main(model_args, data_args, training_args) [rank2]: File "/workspace/train.py", line 139, in main [rank2]: trainer.train(resume_from_checkpoint=checkpoint) [rank2]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank2]: output = super().train(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank2]: return inner_training_loop( [rank2]: ^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank2]: self.optimizer.step() [rank2]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank2]: self.optimizer.step(closure) [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank2]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank2]: out = func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank2]: ret = func(self, *args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank2]: adamw( [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank2]: func( [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank2]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank2]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank2]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/train.py", line 155, in [rank1]: main(model_args, data_args, training_args) [rank1]: File "/workspace/train.py", line 139, in main [rank1]: trainer.train(resume_from_checkpoint=checkpoint) [rank1]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank1]: output = super().train(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank1]: return inner_training_loop( [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank1]: self.optimizer.step() [rank1]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank1]: self.optimizer.step(closure) [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank1]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank1]: out = func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank1]: ret = func(self, *args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank1]: adamw( [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank1]: func( [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank1]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank1]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank1]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding [rank3]: Traceback (most recent call last): [rank3]: File "/workspace/train.py", line 155, in [rank3]: main(model_args, data_args, training_args) [rank3]: File "/workspace/train.py", line 139, in main [rank3]: trainer.train(resume_from_checkpoint=checkpoint) [rank3]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank3]: output = super().train(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank3]: return inner_training_loop( [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank3]: self.optimizer.step() [rank3]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank3]: self.optimizer.step(closure) [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank3]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank3]: out = func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank3]: ret = func(self, *args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank3]: adamw( [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank3]: func( [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank3]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank3]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank3]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding [rank0]: Traceback (most recent call last): [rank0]: File "/workspace/train.py", line 155, in [rank0]: main(model_args, data_args, training_args) [rank0]: File "/workspace/train.py", line 139, in main [rank0]: trainer.train(resume_from_checkpoint=checkpoint) [rank0]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank0]: output = super().train(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank0]: self.optimizer.step() [rank0]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank0]: self.optimizer.step(closure) [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank0]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank0]: ret = func(self, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank0]: adamw( [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank0]: func( [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank0]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank0]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank0]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /workspace/wandb/offline-run-20241121_144135-q8ddduki wandb: Find logs at: wandb/offline-run-20241121_144135-q8ddduki/logs W1121 14:41:50.206000 1838 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1904 closing signal SIGTERM W1121 14:41:50.210000 1838 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1906 closing signal SIGTERM W1121 14:41:50.210000 1838 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1907 closing signal SIGTERM E1121 14:41:50.590000 1838 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1905) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1093, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-11-21_14:41:50 host : c00dfa0c2ca1 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1905) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Running command: accelerate launch --config_file configs/fsdp_config.yaml train.py --seed 100 --model_name_or_path meta-llama/Llama-2-7b-hf --dataset_name smangrul/ultrachat-10k-chatml --chat_template_format chatml --add_special_tokens False --append_concat_token False --splits train,test --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level info --logging_strategy steps --eval_strategy epoch --save_strategy epoch --push_to_hub --hub_private_repo True --hub_strategy every_save --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type cosine --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir mistral-sft-lora-fsdp --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field content --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules all-linear --use_4bit_quantization False config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 1.79MB/s] model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 48.8MB/s] model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [00:47<00:00, 212MB/s] model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:17<00:00, 195MB/s] Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:05<00:00, 32.64s/it] Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:05<00:00, 32.62s/it] Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:05<00:00, 32.62s/it] Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:05<00:00, 32.64s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.88it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7.20it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.14it/s] generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 572kB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 776/776 [00:00<00:00, 3.61MB/s] tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 112MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 22.3MB/s] special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 1.26MB/s] README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 524/524 [00:00<00:00, 1.72MB/s] train-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.2M/35.2M [00:00<00:00, 42.4MB/s] test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.08M/7.08M [00:00<00:00, 40.0MB/s] Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 35997.11 examples/s] Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 35983.77 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 230.91 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 295.64 examples/s] Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2201.27 examples/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2222.97 examples/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 311.74 examples/s] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Loading checkpoint shards: 50%|██████████████████████████████████████████████████████████████████████▌ | 1/2 [00:08<00:08, 8.60s/it][rank3]:[W1121 14:38:20.995918785 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank2]:[W1121 14:38:20.000367175 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. [rank1]:[W1121 14:38:20.005100558 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.53s/it] Size of the train set: 10. Size of the validation set: 10 A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"} Generating train split: 8 examples [00:00, 278.11 examples/s] Generating train split: 8 examples [00:00, 309.70 examples/s] [rank0]:[W1121 14:38:29.329199115 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 Using auto half precision backend trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 PeftModelForCausalLM( (base_model): LoraModel( (model): LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32008, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaAttention( (q_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=11008, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=11008, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=4096, out_features=11008, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=11008, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=11008, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=11008, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) ) ) (norm): LlamaRMSNorm((4096,), eps=1e-05) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=4096, out_features=32008, bias=False) ) ) ) trainable params: 19,988,480 || all params: 6,758,469,632 || trainable%: 0.2958 [2024-11-21 14:38:31,860] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-21 14:38:31,871] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-21 14:38:31,894] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-21 14:38:31,920] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /root/.triton/autotune: No such file or directory ***** Running training ***** Num examples = 8 Num Epochs = 1 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 128 Gradient Accumulation steps = 4 Total optimization steps = 1 Number of trainable parameters = 4,997,120 Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.18.7 wandb: W&B syncing is set to `offline` in this directory. wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. 0%| | 0/1 [00:00, ?it/s][rank1]: Traceback (most recent call last): [rank1]: File "/workspace/train.py", line 155, in[rank1]: main(model_args, data_args, training_args) [rank1]: File "/workspace/train.py", line 139, in main [rank1]: trainer.train(resume_from_checkpoint=checkpoint) [rank1]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank1]: output = super().train(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank1]: return inner_training_loop( [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank1]: self.optimizer.step() [rank1]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank1]: self.optimizer.step(closure) [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank1]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank1]: out = func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank1]: ret = func(self, *args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank1]: adamw( [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank1]: func( [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank1]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank1]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank1]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding Traceback (most recent call last): [rank3]: Traceback (most recent call last): [rank3]: File "/workspace/train.py", line 155, in [rank3]: main(model_args, data_args, training_args) [rank3]: File "/workspace/train.py", line 139, in main [rank3]: trainer.train(resume_from_checkpoint=checkpoint) [rank3]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank3]: output = super().train(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank3]: return inner_training_loop( [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank3]: self.optimizer.step() [rank3]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank3]: self.optimizer.step(closure) [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank3]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank3]: out = func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank3]: ret = func(self, *args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank3]: adamw( [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank3]: func( [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank3]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank3]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank3]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding File "/workspace/train.py", line 155, in main(model_args, data_args, training_args) File "/workspace/train.py", line 139, in main trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train output = super().train(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step self.optimizer.step(closure) File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper return func.__get__(opt, opt.__class__)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper out = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad ret = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step adamw( File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw func( File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding [rank2]: Traceback (most recent call last): [rank2]: File "/workspace/train.py", line 155, in [rank2]: main(model_args, data_args, training_args) [rank2]: File "/workspace/train.py", line 139, in main [rank2]: trainer.train(resume_from_checkpoint=checkpoint) [rank2]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank2]: output = super().train(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank2]: return inner_training_loop( [rank2]: ^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank2]: self.optimizer.step() [rank2]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank2]: self.optimizer.step(closure) [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank2]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank2]: out = func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank2]: ret = func(self, *args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank2]: adamw( [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank2]: func( [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank2]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank2]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank2]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding [rank0]: Traceback (most recent call last): [rank0]: File "/workspace/train.py", line 155, in [rank0]: main(model_args, data_args, training_args) [rank0]: File "/workspace/train.py", line 139, in main [rank0]: trainer.train(resume_from_checkpoint=checkpoint) [rank0]: File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 434, in train [rank0]: output = super().train(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1929, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2341, in _inner_training_loop [rank0]: self.optimizer.step() [rank0]: File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step [rank0]: self.optimizer.step(closure) [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank0]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 487, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank0]: ret = func(self, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 220, in step [rank0]: adamw( [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 782, in adamw [rank0]: func( [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/adamw.py", line 480, in _multi_tensor_adamw [rank0]: grouped_tensors = Optimizer._group_tensors_by_device_and_dtype( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 516, in _group_tensors_by_device_and_dtype [rank0]: return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices) # type: ignore[return-value, arg-type] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_foreach_utils.py", line 37, in _group_tensors_by_device_and_dtype [rank0]: return torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /workspace/wandb/offline-run-20241121_143903-enf1o3qd wandb: Find logs at: wandb/offline-run-20241121_143903-enf1o3qd/logs W1121 14:39:25.634000 1364 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1430 closing signal SIGTERM W1121 14:39:25.635000 1364 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1431 closing signal SIGTERM W1121 14:39:25.635000 1364 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1433 closing signal SIGTERM E1121 14:39:25.977000 1364 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 1432) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1093, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-11-21_14:39:25 host : c00dfa0c2ca1 rank : 2 (local_rank: 2) exitcode : 1 (pid: 1432) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Package Version --------------------------------- -------------- absl-py 2.1.0 accelerate 0.33.0 aiohappyeyeballs 2.4.3 aiohttp 3.11.6 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 attrs 24.2.0 babel 2.16.0 beautifulsoup4 4.12.3 bitsandbytes 0.44.1 bleach 6.1.0 blinker 1.4 certifi 2024.8.30 cffi 1.17.1 charset-normalizer 3.3.2 click 8.1.7 comm 0.2.2 contourpy 1.3.1 cryptography 3.4.8 cut-cross-entropy 24.11.4 cycler 0.12.1 datasets 3.1.0 datatrove 0.3.0 dbus-python 1.2.18 debugpy 1.8.5 decorator 5.1.1 deepspeed 0.15.4 defusedxml 0.7.1 Deprecated 1.2.15 dill 0.3.8 distro 1.7.0 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 entrypoints 0.4 evaluate 0.4.3 executing 2.1.0 fastjsonschema 2.20.0 filelock 3.13.1 flash-attn 2.7.0.post2 fonttools 4.55.0 fqdn 1.5.1 frozenlist 1.5.0 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.43 grpcio 1.68.0 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 1.0.5 httplib2 0.20.2 httpx 0.27.2 huggingface-hub 0.26.2 humanize 4.11.0 idna 3.10 importlib-metadata 4.6.4 ipykernel 6.29.5 ipython 8.27.0 ipython-genutils 0.2.0 ipywidgets 8.1.5 isoduration 20.11.0 jedi 0.19.1 jeepney 0.7.1 Jinja2 3.1.3 joblib 1.4.2 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 jupyter-archive 3.4.0 jupyter_client 7.4.9 jupyter_contrib_core 0.4.2 jupyter_contrib_nbextensions 0.7.0 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-highlight-selected-word 0.2.0 jupyter-lsp 2.2.5 jupyter_nbextensions_configurator 0.6.4 jupyter_server 2.14.2 jupyter_server_terminals 0.5.3 jupyterlab 4.2.5 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 jupyterlab_widgets 3.0.13 keyring 23.5.0 kiwisolver 1.4.7 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 loguru 0.7.2 lxml 5.3.0 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mistune 3.0.2 more-itertools 8.10.0 mpmath 1.3.0 msgpack 1.1.0 multidict 6.1.0 multiprocess 0.70.16 nbclassic 1.1.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.2.1 ninja 1.11.1.1 nltk 3.9.1 notebook 6.5.5 notebook_shim 0.2.4 numpy 1.26.3 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 oauthlib 3.2.0 overrides 7.7.0 packaging 24.1 pandas 2.2.3 pandocfilters 1.5.1 parso 0.8.4 peft 0.13.3.dev0 pexpect 4.9.0 pillow 10.2.0 pip 24.2 platformdirs 4.3.6 prometheus_client 0.21.0 prompt_toolkit 3.0.47 propcache 0.2.0 protobuf 3.20.3 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyarrow 18.0.0 pycparser 2.22 pydantic 2.10.0 pydantic_core 2.27.0 PyGithub 2.5.0 Pygments 2.18.0 PyGObject 3.42.1 PyJWT 2.10.0 PyNaCl 1.5.0 pyparsing 2.4.7 python-apt 2.4.0+ubuntu4 python-dateutil 2.9.0.post0 python-json-logger 2.0.7 pytz 2024.2 PyYAML 6.0.2 pyzmq 24.0.1 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.9.4 rpds-py 0.20.0 safetensors 0.4.5 scikit-learn 1.5.2 scipy 1.14.1 SecretStorage 3.3.1 Send2Trash 1.8.3 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.4 setuptools 75.1.0 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.6 stack-data 0.6.3 sympy 1.13.1 tensorboard 2.18.0 tensorboard-data-server 0.7.2 terminado 0.18.1 threadpoolctl 3.5.0 tiktoken 0.8.0 tinycss2 1.3.0 tokenizers 0.19.1 torch 2.5.1 torchaudio 2.4.1+cu124 torchvision 0.19.1+cu124 tornado 6.4.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.44.2 triton 3.1.0 trl 0.11.0 types-python-dateutil 2.9.0.20240906 typing_extensions 4.12.2 tyro 0.9.1 tzdata 2024.2 unsloth 2024.11.8 unsloth_zoo 2024.11.6 uri-template 1.3.0 urllib3 2.2.3 wadllib 1.3.6 wandb 0.18.7 wcwidth 0.2.13 webcolors 24.8.0 webencodings 0.5.1 websocket-client 1.8.0 Werkzeug 3.1.3 wheel 0.44.0 widgetsnbextension 4.0.13 wrapt 1.16.0 xformers 0.0.28.post3 xxhash 3.5.0 yarl 1.17.2 zipp 1.0.0
@BenjaminBossan I use Runpod services to run it, while I use runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
after installing pip requirements it upgrades pytorch to 2.5.1
It worked for me ! thank you @BenjaminBossan !
@vrancurel Can you share your environment details including Python, Cuda, and pip list
Python 3.10 CUDA 12.4 and exactly the same versions as you suggested.
I want to run sft example and I get some erros, Can you help me to find the problem?
I run run_peft_fsdp.sh with
--model_name_or_path "meta-llama/Llama-2-7b-hf"
(I used smaller model, just for test purpose)I use
pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
. Here are my environments details and errosPackages