state-spaces / s4

Structured state space sequence models
Apache License 2.0
2.37k stars 284 forks source link

Memory Corruption Error in Kernel _setup_linear #56

Open ethanbar11 opened 2 years ago

ethanbar11 commented 2 years ago

Hey, I'm trying to use the forward_state function. From time to time, I get this error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Jumping out of:

File "/media/data2/ethan_baron/state-spaces-improv/src/models/sequence/ss/kernel.py", line 434, in _setup_linear
    R = torch.linalg.solve(R.to(Q_D), Q_D)  # (H r N)

Meaning, from this lines (433-436) in the NPLR Kernel:

        try:
            R = torch.linalg.solve(R.to(Q_D), Q_D)  # (H r N)
        except torch._C._LinAlgError:
            R = torch.tensor(np.linalg.solve(R.to(Q_D).cpu(), Q_D.cpu())).to(Q_D)

I changed very little this lines for debugging, for:

try:
    R = torch.linalg.solve(R.to(Q_D), Q_D)  # (H r N)
except:
    x1 = R.to(Q_D).cpu()
    x2 = R.to(Q_D).cpu()
    R = torch.tensor(np.linalg.solve(x1, x2)).to(Q_D)

EDIT: Removed stacktrace (was quite unhelpful and long) and edited the code to be in code snippets.

albertfgu commented 2 years ago

I looked into this recently and also found the same issue, which wasn't present before. I wasn't able to figure out why. It's weird that it happens randomly.

Regardless, the implementation of "state forwarding" (README) is currently unoptimized for S4 so it is not recommended to use this. If you want this functionality, it should work with S4D. Feel free to file another issue if something comes up.

Finally, could you please edit the original issue here to be shorter, and in particular remove at least the last part of the stack trace. It might also help to put the whole thing in a code block. The last few lines are all parsed in a way that references other Issues which is confusing.

ethanbar11 commented 2 years ago

Yeah, I tried to look into it for a couple of days and didn't understand what happened. I'm using now the S4D forward_state version and until now it works quite well. Edited the issue, hopefully to be more readable. Thanks!