Open wyhsleep opened 8 months ago
Did you follow the suggestion in the error message?
Yes, but the randomness still remains
Get Outlook for iOShttps://aka.ms/o0ukef
From: Tri Dao @.> Sent: Monday, January 29, 2024 4:31:59 PM To: state-spaces/mamba @.> Cc: SLEEPNOW @.>; Author @.> Subject: Re: [state-spaces/mamba] Question Regarding Randomness (Issue #137)
Did you follow the suggestion in the error message?
— Reply to this email directly, view it on GitHubhttps://github.com/state-spaces/mamba/issues/137#issuecomment-1914199232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARUZYXXEOS76KGLVN3WAY3DYQ5M75AVCNFSM6AAAAABCO75UY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUGE4TSMRTGI. You are receiving this because you authored the thread.Message ID: @.***>
I'm not sure where the randomness is from. Can you comment out lines in the Mamba implementation to isolate?
Hi, I tried to analyze where the randomness comes from. And I find that during training, when running the same model under the same settings, the last few digits of the loss start to differ after the first iteration. However, if I remove the mamba module from our model, the loss returns to normal. We are wondering if this is related to computational precision. We use 32-bit float data precision for calculations and employ the built-in cross-entropy loss from the torch library as the loss optimizer, which is optim.Adam.
The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.
Oh, noted with thanks. So the randomness here is normal, right?
Normal if you're training the model, not normal if you're only doing inference (forward pass only).
Oh, great thanks!!!
Oh, great thanks!!!
There's a specific call for cuda's deterministic processing in torch, also setting a seed for randomness is often important. This should make the backpropagation more repeatable in training which is useful if trying to compare hyperparameter or training changes. I also oddly find that using deterministic processing for backpropagation changes the convergent behaviour of a model, oddly increasing the rate of convergence, but that may be just my specific model.
Using RAdam instead of Adam also provides better repeatability due to its controlled warmup of parameters (given that Adam internally optimises learning rate, when we are setting a learning rate for Adam, we are essentially setting a global learning rate).
The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.
Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!
The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.
Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!
One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.
The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.
Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!
One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.
Thanks for you replay : )
Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1
and x2
, the result of model(torch.stack([x1, x2])
(i.e. batching) differs from torch.stack([model(x1), model(x2)])
, especially if I use fp16
or bf16
(gap is very close if I use fp32
). Is this also an expected behavior?
Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs
x1
andx2
, the result ofmodel(torch.stack([x1, x2])
(i.e. batching) differs fromtorch.stack([model(x1), model(x2)])
, especially if I usefp16
orbf16
(gap is very close if I usefp32
). Is this also an expected behavior?
Can you isolate which layer or function that first produces different outputs?
I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.
I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.
Can you isolate which layer or function that first produces different outputs?
When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.
File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.
File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
Did you also try the 16:8 environment variable it also mentioned? Sorry if it seems I'm pointing out the obvious, it's just you may have accidentally missed it.
Just curious if anyone found out the root cause of the randomness in the inference run? I am generating an ONNX model and I'm trying to compare the outputs from pytorch vs ONNX. With this randomness, it is difficult
I've also noticed this issue, but my forward propagation is the same. After one iteration of backward propagation, inconsistencies appear. In addition, I've found that even when setting num_workers > 0
in DataLoader, I still encounter the error "DataLoader worker (pid(s) 15804) exited unexpectedly." I can only set num_workers
to 0.
Now I'm very troubled, as I can't use num_workers > 0
to speed up, and also can't tune parameters due to the inherent randomness of Mamba.
I found that setting a larger value for num_workers
eliminates the issue of DataLoader worker (pid(s) 15804) exiting unexpectedly, which is similar to setting num_workers
to 25 in https://github.com/hustvl/Vim/blob/main/vim/scripts/pt-vim-t.sh.
Hello,
Thanks for your interesting work but I have a question about the code that I'd like to discuss with you. Despite fixing all the random seeds, I'm still observing randomness in the results of my runs. And I use the
torch.use_deterministic_algorithms(True)
and run the example code as follows: `mport torch from mamba_ssm import Mamba torch.use_deterministic_algorithms(True)batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba(
This module uses roughly 3 expand d_model^2 parameters
).to("cuda") y = model(x) assert y.shape == x.shape`
I got the error message:
Traceback (most recent call last): File "test.py", line 14, in <module> y = model(x) File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "//lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 136, in forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), RuntimeError: Deterministic behavior was enabled with either
torch.use_deterministic_algorithms(True)or
at::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
It seems that there is a randomness in the mamba module. Did you guys encounter this before? Thank you for your help and thank you for your great job again!