Open chenwuchen opened 10 months ago
@chenwuchen Bumped into the same error.
Solution: Use DDP or torchrun
.
I was facing with the same issue. Upon investigating further, I realized that the line 75 of the autotuner.py file in the triton package receives self.nargs=None
, which casues problems with generating a dictionary from it:
(inside the _bench
method)
full_nargs = {**self.nargs, **current}
My take is that this may:
self.nargs=None
)self.nargs=None
unintentionallyBut, I was able to run the dataparallel by replacing the line 75 of autotuner.py with these modifications:
full_nargs = {}
if self.nargs:
full_nargs.update(self.nargs)
if current:
full_nargs.update(current)
Though, not sure if the training would overall behave as expected.
Needed to do the same, I am executing the trainer script from mamba_chat and got a triton error same as above. Patching the package and installing:
Get the package
git clone https://github.com/openai/triton.git;
git checkout release/2.1.x;
pip install cmake;
Patch it by editing python/triton/runtime/autotuner.py
at line 75
replacing
full_nargs = {**self.nargs, **current}
with
full_nargs = {}
if self.nargs:
full_nargs.update(self.nargs)
if current:
full_nargs.update(current)
proceed to install the patched version:
cd triton/python;
pip install -e .
install the rest of mamba dependencies as per normal
(I am using Pyton 3.11 in a conda environment, currently have training running on a pair of RTX 3090s)
Hacking it that way could cause silent errors (especially if different args are passed concurrently to the jit)
Looks like DataParallel
is multi-threaded and Triton doesn't appear to be thread-safe.
If you're confident the rest of the forward pass is thread-safe and you must run in threaded mode, you could try to run a single pass first to boostrap the jit before running it in parallel. I don't think mamba has a config.pre_hook, which is the only thing that would be written per run after benching is complete.
@PheelaV Can you provide more details about how to conduct triton from source? Thanks! I can't install triton with pyton 3.8.5 in a conda environment.
.triton/llvm/llvm+mlir-17.0.0-x86_64-linux-gnu-ubuntu-18.04-release/include/mlir/IR/Value.h:95:56: error: ‘void* __builtin_memset(void*, int, long unsigned int)’ specified size between 18446744039349813224 and 18446744073709551608 exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
95 | constexpr Value(detail::ValueImpl *impl = nullptr) : impl(impl) {}
...
270 | iterator begin() { return (iterator)this->BeginX; }
| ~~~~~~^~~~~~
At global scope:
cc1plus: note: unrecognized command-line option ‘-Wno-covered-switch-default’ may have been intended to silence earlier diagnostics
cc1plus: all warnings being treated as errors
gmake[2]: *** [lib/Conversion/TritonGPUToLLVM/CMakeFiles/obj.TritonGPUToLLVM.dir/build.make:163: lib/Conversion/TritonGPUToLLVM/CMakeFiles/obj.TritonGPUToLLVM.dir/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp.o] Error 1
gmake[2]: Leaving directory 'triton/python/build/cmake.linux-x86_64-cpython-3.8'
gmake[1]: *** [CMakeFiles/Makefile2:2299: lib/Conversion/TritonGPUToLLVM/CMakeFiles/obj.TritonGPUToLLVM.dir/all] Error 2
gmake[1]: Leaving directory 'triton/python/build/cmake.linux-x86_64-cpython-3.8'
gmake: *** [Makefile:149: all] Error 2
...
ERROR: Could not build wheels for triton, which is required to install pyproject.toml-based projects
@AndssY Sorry I can't really, I was using python 3.10 or .11 on Ubuntu 22.04 LTS and following their readme instructions everything worked out. I think the only thing I had to do out of standard was to follow the specific release branch as requested per mamba dependencies (important).
But this whole thing became redundant. I think Mamba was patched and everything suddenly started to work with just mamba-ssm and conv1d install. I still kept the environment with a triton built from source to be sure, but it was no longer necessary for me.
Hope that helps at least a little bit. Good luck with getting it up and running. Feel free to DM me if you still struggle, I think I went through all the jumps and hoops I could have met.
...and following their readme instructions everything worked out. I think the only thing I had to do out of standard was to follow the specific release branch as requested per mamba dependencies (important).
@PheelaV Did you install according to the readme of maba-chat? So mamba-ssm==1.0.1
and triton==release/2.1.x
?
I will try python==3.11 and install it again following the readme of mamba-chat. Thanks very much!
Has it been resolved?
Title: Error when running multi-GPU training with Mamba
Description: I am experiencing an issue when running multi-GPU training with Mamba. Specifically, I am getting a TypeError: 'NoneType' object is not a mapping error when running the forward pass of the model. The error occurs when I try to run the model on multiple GPUs using the DataParallel module. However, when I run the model on a single GPU, everything works fine.
I have tried to reproduce the issue with a minimal example, but I was unable to do so. I have also checked the documentation and searched online for similar issues, but I couldn't find anything useful.
Here is the full traceback of the error:
I am using Python 3.10, PyTorch 1.12.1, and causal_conv1d-1.1.1 mamba-ssm-1.1.1 triton-2.1.0