Closed hmhyau closed 5 years ago
After a few days of digging I've located the origin of these problems. It turns out that too little memory is allocated for ODrM*. Once I removed these two lines in cython_od_mstar.pyx and recompile, it works fine as long as there is enough RAM.
# Comment or remove these lines
# import resource
# resource.setrlimit(resource.RLIMIT_AS, (2**33,2**33)) # 8Gb
Closing the issue as it is solved as of now.
Hi Guillaume,
First of all, thanks for the enlightening work on PRIMAL.
I cloned the code and attempted to train a new model by the Jupyter notebook on a NVIDIA DGX Station. Model inference works fine, so I proceed to attempting to train my own model. As per NVIDIA's instructions, I created a new Docker image and run the code inside a Docker container. This is followed by installation of all dependencies and compilation of cpp_mstar. This doesn't work and I got different errors almost every time when I run the notebook.
CUBLAS_STATUS_NOT_INITIALIZED
CUDA OOM Error
On rare occasions when it can be run - std::system_error: Resource temporarily not available
On rare occasions when it can be run - IndexError: too many indices for array
There are several changes I've made over the past week to try to make it run properly, including decreasing number of threads and/or number of meta-agents but these doesn't help. Converting the notebook to .py script also fails.
CPU training fails too with ResourceExhausedError:
which doesn't make much sense with 256GB RAM equipped in the server.
It would be great if you can provide some assistance to tackle the issues.
Best, Herman