intel / torch-xpu-ops

Apache License 2.0
30 stars 21 forks source link

CTCLoss: Fix the hang issue caused by barrier divergence #1087

Closed xytintel closed 6 days ago

xytintel commented 1 week ago

Resolve https://github.com/pytorch/pytorch/issues/140781

dvrogozh commented 1 week ago

With this PR issue reported in https://github.com/pytorch/pytorch/issues/140781 is gone. The HF tests for hubert model pass as follows for me: 156 passed, 257 skipped, 3 warnings in 52.34s.

xytintel commented 6 days ago

I trust you decision that barrier really is not needed. Other than that change works to fix the issue I noticed. Consider to extend test coverage to cover the missed case.

We still need barriers. The reason for the hang is that some threads exit prematurely, preventing the counter from resetting to zero. We are now planning to use named barrier to solve this problem.

dvrogozh commented 6 days ago

We are now planning to use named barrier to solve this problem.

I was said by sycl folks that named barriers might have performance drawbacks on current generations. Be careful to verify performance.

dvrogozh commented 6 days ago

I verified updated version. It works to address reported issue.