wyhsleep commented 8 months ago

Hello,

Thanks for your interesting work but I have a question about the code that I'd like to discuss with you. Despite fixing all the random seeds, I'm still observing randomness in the results of my runs. And I use thetorch.use_deterministic_algorithms(True) and run the example code as follows: `mport torch from mamba_ssm import Mamba torch.use_deterministic_algorithms(True)

batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba(

This module uses roughly 3 expand d_model^2 parameters

d_model=dim, # Model dimension d_model
d_state=16,  # SSM state expansion factor
d_conv=4,    # Local convolution width
expand=2,    # Block expansion factor

).to("cuda") y = model(x) assert y.shape == x.shape`

I got the error message: Traceback (most recent call last): File "test.py", line 14, in <module> y = model(x) File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "//lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 136, in forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), RuntimeError: Deterministic behavior was enabled with eithertorch.use_deterministic_algorithms(True)orat::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility It seems that there is a randomness in the mamba module. Did you guys encounter this before? Thank you for your help and thank you for your great job again!

tridao commented 8 months ago

Did you follow the suggestion in the error message?

wyhsleep commented 8 months ago

Yes, but the randomness still remains

Get Outlook for iOShttps://aka.ms/o0ukef

From: Tri Dao @.> Sent: Monday, January 29, 2024 4:31:59 PM To: state-spaces/mamba @.> Cc: SLEEPNOW @.>; Author @.> Subject: Re: [state-spaces/mamba] Question Regarding Randomness (Issue #137)

Did you follow the suggestion in the error message?

— Reply to this email directly, view it on GitHubhttps://github.com/state-spaces/mamba/issues/137#issuecomment-1914199232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARUZYXXEOS76KGLVN3WAY3DYQ5M75AVCNFSM6AAAAABCO75UY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUGE4TSMRTGI. You are receiving this because you authored the thread.Message ID: @.***>

tridao commented 8 months ago

I'm not sure where the randomness is from. Can you comment out lines in the Mamba implementation to isolate?

wyhsleep commented 8 months ago

Hi, I tried to analyze where the randomness comes from. And I find that during training, when running the same model under the same settings, the last few digits of the loss start to differ after the first iteration. However, if I remove the mamba module from our model, the loss returns to normal. We are wondering if this is related to computational precision. We use 32-bit float data precision for calculations and employ the built-in cross-entropy loss from the torch library as the loss optimizer, which is optim.Adam.

tridao commented 8 months ago

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

wyhsleep commented 8 months ago

Oh, noted with thanks. So the randomness here is normal, right？

tridao commented 8 months ago

Normal if you're training the model, not normal if you're only doing inference (forward pass only).

wyhsleep commented 8 months ago

Oh, great thanks!!!

ElliottDyson commented 8 months ago

Oh, great thanks!!!

There's a specific call for cuda's deterministic processing in torch, also setting a seed for randomness is often important. This should make the backpropagation more repeatable in training which is useful if trying to compare hyperparameter or training changes. I also oddly find that using deterministic processing for backpropagation changes the convergent behaviour of a model, oddly increasing the rate of convergence, but that may be just my specific model.

Using RAdam instead of Adam also provides better repeatability due to its controlled warmup of parameters (given that Adam internally optimises learning rate, when we are setting a learning rate for Adam, we are essentially setting a global learning rate).

Charlie839242 commented 7 months ago

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

tridao commented 7 months ago

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.

Charlie839242 commented 7 months ago

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.

Thanks for you replay : )

sangkeun00 commented 6 months ago

Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1 and x2, the result of model(torch.stack([x1, x2]) (i.e. batching) differs from torch.stack([model(x1), model(x2)]), especially if I use fp16 or bf16 (gap is very close if I use fp32). Is this also an expected behavior?

tridao commented 6 months ago

Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1 and x2, the result of model(torch.stack([x1, x2]) (i.e. batching) differs from torch.stack([model(x1), model(x2)]), especially if I use fp16 or bf16 (gap is very close if I use fp32). Is this also an expected behavior?

Can you isolate which layer or function that first produces different outputs?

mhy9989 commented 4 months ago

I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.

tridao commented 4 months ago

I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.

Can you isolate which layer or function that first produces different outputs?

dongzhuoyao commented 4 months ago

When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.

File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
    self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

ElliottDyson commented 4 months ago

When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.

File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
    self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Did you also try the 16:8 environment variable it also mentioned? Sorry if it seems I'm pointing out the obvious, it's just you may have accidentally missed it.

xiaoliangbai commented 3 months ago

Just curious if anyone found out the root cause of the randomness in the inference run? I am generating an ONNX model and I'm trying to compare the outputs from pytorch vs ONNX. With this randomness, it is difficult

GuHongyang commented 2 months ago

I've also noticed this issue, but my forward propagation is the same. After one iteration of backward propagation, inconsistencies appear. In addition, I've found that even when setting num_workers > 0 in DataLoader, I still encounter the error "DataLoader worker (pid(s) 15804) exited unexpectedly." I can only set num_workers to 0.

Now I'm very troubled, as I can't use num_workers > 0 to speed up, and also can't tune parameters due to the inherent randomness of Mamba.

GuHongyang commented 2 months ago

I found that setting a larger value for num_workers eliminates the issue of DataLoader worker (pid(s) 15804) exiting unexpectedly, which is similar to setting num_workers to 25 in https://github.com/hustvl/Vim/blob/main/vim/scripts/pt-vim-t.sh.

state-spaces / mamba

Question Regarding Randomness #137

This module uses roughly 3 expand d_model^2 parameters