huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.22k stars 696 forks source link

DeepSpeed giving Assertion Error #1324

Open harik68 opened 1 year ago

harik68 commented 1 year ago

I am facing some issues whe using Deep Speed for fine tuning StarCoder Model. I am exactly following the steps mentioned in this article Creating a Coding Assistant with StarCoder (section Fine-tuning StarCoder with DeepSpeed ZeRO-3). However I am getting the error “AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu gradient_acc_step world_size 256 != 4 8 1”. I did some research on this on Google and found this link explaining the reason [BUG] batch_size check failed with zero 2 (deepspeed v0.9.0) · Issue #3228 · microsoft/DeepSpeed · GitHub However even if I use the version of deepspeed mentioned in this article as working (v 0.9.0) I am getting the same error. I tried different versions of deepspeed and accelerate but couldn’t fix the issue. Any one has any suggestions? Thanks in advance.

jaywongs commented 1 year ago

Same problem here, did you solve it at last?

pcuenca commented 1 year ago

cc @lewtun @philschmid

xbinglzh commented 1 year ago

同样的问题,你终于解决了吗? deepspeed 0.10.0 错误信息: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu gradient_acc_step world_size 64 != 8 1 1

jaywongs commented 1 year ago

xbinglzh

Solved it by using transformer==4.29.2. But I don't think this method applies to all situations where this problem occurs.

BramVanroy commented 1 year ago

@pcuenca So, is this a transformers issue? Downgrading to transformer==4.29.2 did not help in my case.

ethanyanjiali commented 9 months ago

what was the problemetic version of transformers?