I'm yet to get our training component test to run on Sunspot. A few things that could be the reason or a solution
[x] That we are using PyTorch 2 on Sunspot, and Difflinker was built using PyTorch 1.13. Not the answer. Training works on CPU on Sunspot with 2, and on CUDA on my desktop on 2.
[x] My (feeble) attempt at a XPU wrapper is insufficient. I got the same error with my version and Corey's (better) implementation
[x] Intel's or Corey's fork of Lightning could work
[ ] There is an missing .to(device) somewhere in the workflow
I'm yet to get our training component test to run on Sunspot. A few things that could be the reason or a solution
.to(device)
somewhere in the workflow