huggingface accelerate issues

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.97k stars 970 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Accelerate + FSDP plugin hang on after model save intermediate checkpoint

#3250 leeruibin opened 6 hours ago
1
examples/inference/pippy/llama.py Assertion error about graphs

#3249 685Degrees opened 10 hours ago
0
Fix: Resolve #3060

#3248 wejoncy opened 10 hours ago
0
Use `numpy._core` instead of `numpy.core`

#3247 qgallouedec closed 21 hours ago
4
[`data_loader`] Optionally also propagate set_epoch to batch sampler

#3246 tomaarsen closed 1 day ago
3
RuntimeError: The server socket has failed to listen on any local network address.

#3245 liujf69 closed 3 days ago
1
Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary

#3244 MekkCyber opened 4 days ago
1
🚀 Feature Request: Improve `stateful_dataloader` by passing `snapshot_every_n_steps`

#3243 yzhangcs opened 4 days ago
0
Wrong epoch when resuming from checkpoint

#3242 xiechun-tsukuba opened 4 days ago
0
deepspeed inference

#3241 Reginald-L opened 5 days ago
0
Communication problems with deepspeed zero3

#3240 Reginald-L closed 4 days ago
0
OOM error when training llama 7B model using Accelerate FSDP setting

#3239 JlPang863 opened 1 week ago
1
deepspeed zero3 save model

#3238 Reginald-L closed 6 days ago
2
slurmstepd: error: execve(): accelerate: No such file or directory

#3237 huiyang865 closed 4 days ago
3
enable `find_executable_batch_size` on XPU

#3236 faaany closed 2 days ago
2
[docs] update code in tracking documentation

#3235 faaany closed 1 day ago
1
[docs] add XPU to profiler documentation and fix minor bugs

#3234 faaany closed 1 day ago
2
Code Logical Bug: Using Init Handler Kwargs for Grad Scaler In FP8 Training (accelerate/accelerator.py)

#3233 immortalCO opened 1 week ago
0
fsdp checkpoint saving leads to NCCL WARN Cuda failure 2 'out of memory'

#3232 edchengg opened 1 week ago
0
[RFC] Support FSDP2

#3231 kmehant opened 1 week ago
1
Error while fine tuning with peft, lora, accelerate, SFTConfig and SFTTrainer

#3230 Isdriai opened 1 week ago
4
Fix slurm multinode example

#3229 ffrancesco94 opened 1 week ago
0
[docs] update set-seed

#3228 faaany opened 2 weeks ago
3
[docs] add instruction to install bnb on non-cuda devices

#3227 faaany closed 1 day ago
1
take care of case when "_tied_weights_keys" is not an attribute

#3226 fabianlim closed 1 day ago
2
torch.cuda.is_available() false when running multi-gpu inference with accelerate launch

#3225 paulgekeler closed 3 days ago
1
"mat2 must be a matrix" error when finetuning Dreambooth flux with FSDP

#3224 weixiong-ur opened 2 weeks ago
2
remove hook for bnb 4-bit

#3223 SunMarc closed 6 days ago
3
Add case-insensitive parsing of bool environment variables

#3222 wizeng23 opened 2 weeks ago
0
[docs] fix typo

#3221 faaany opened 2 weeks ago
2
[docs] use real path for `checkpoint`

#3220 faaany opened 2 weeks ago
2
Ensure explicit output `dtype` for `pad_across_processes`

#3219 mariusarvinte opened 2 weeks ago
0
Incorrect type in output of `utils.pad_across_processes` when input is `torch.bool`

#3218 mariusarvinte opened 2 weeks ago
1
Fix `align_module_device`, ensure only cpu tensors for `get_state_dict_offloaded_model`

#3217 kylesayrs closed 2 weeks ago
1
PyPI published Accelerate==1.1.0 is missing Source Distributions

#3216 helloworld1 opened 2 weeks ago
3
Milad

#3215 Milad335t closed 2 weeks ago
0
ConnectionError: Tried to launch distributed communication on port `29401`, but another process is utilizing it. Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

#3214 qinchangchang opened 2 weeks ago
0
create _preprare_fsdp to pre- prepare fsdp model training

#3213 eljandoubi opened 2 weeks ago
2
Timeout at validation step

#3212 qmin2 closed 2 weeks ago
1
fix load_state_dict for npu

#3211 statelesshz opened 2 weeks ago
1
How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint?

#3210 liming-ai opened 2 weeks ago
0
The optimizer is not receiving the FSDP model parameters.

#3209 eljandoubi opened 3 weeks ago
6
Multiple node inference

#3208 DLCM-wrz opened 3 weeks ago
0
typo fix in big_modeling.py

#3207 a-r-r-o-w closed 3 weeks ago
1
Multinode, multigpu example fails

#3206 ffrancesco94 opened 3 weeks ago
9
Type of Accelerator.distributed_type() might be wrong

#3205 ffrancesco94 closed 3 weeks ago
4
[Utils] `align_module_device`

#3204 kylesayrs closed 3 weeks ago
2
Command line arguments related to deepspeed for `accelerate launch` do not override those of `default_config.yaml`

#3203 JdbermeoUZH opened 3 weeks ago
0
Problem with metrics calculation and dataloader

#3202 gssriram opened 3 weeks ago
0
What should I pass to fsdp_config.fsdp_transformer_layer_cls_to_wrap argument in the yaml file?

#3201 ShengYun-Peng closed 2 weeks ago
3