Python : 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Platform : Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
Legion : v24.01.00.dev-38-g90944d7
Legate : 24.01.00.dev+38.g90944d7
WARNING: Disabling control replication for interactive run
Disable Control Replication
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: c315-012
Local device: mlx5_0
--------------------------------------------------------------------------
Cunumeric : 24.01.00.dev+29.g503affb8
Numpy : 2.0.0
Scipy : 1.14.0
Numba : 0.60.0
/work/08435/srvenkat/ls6/miniconda3/lib/python3.12/site-packages/conda_package_streaming/package_streaming.py:25: UserWarning: zstandard could not be imported. Running without .conda support.
warnings.warn("zstandard could not be imported. Running without .conda support.")
/work/08435/srvenkat/ls6/miniconda3/lib/python3.12/site-packages/conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
_warnings.warn("Install zstandard Python bindings for .conda support")
CTK package : cuda-version-12.4-hbda6634_3 (pkgs/main)
GPU driver : 535.104.12
GPU devices :
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
I ran the cholesky.py example with -n 257 and expected to see the timing/flops output.
Observed behavior
I got an error saying the matrix is not positive definite. This was strange since I believe the example uses an identity matrix. I do not get the error for -n 256 or less.
Example code or instructions
legate --gpus 1 ./cholesky.py -n 257
Stack traceback or browser console output
(legate-ucx) c315-012.ls6(1033)$ legate --gpus 1 ./cholesky.py -n 257
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: c315-012
Local device: mlx5_0
--------------------------------------------------------------------------
Elapsed Time: 52.263 ms
108267.2062453361 GOP/s
[0 - 14f511066000] 1.320818 {6}{python}: python exception occurred within task:
numpy.linalg.LinAlgError: Matrix is not positive definite
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legion_top.py", line 481, in legion_python_main
cleanup()
File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 2164, in _cleanup_legate_runtime
runtime.destroy()
File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 1322, in destroy
self.raise_exceptions()
File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 2075, in raise_exceptions
pending.raise_exception()
File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/exception.py", line 50, in raise_exception
raise exn_reraised from exn_original
numpy.linalg.LinAlgError: Matrix is not positive definite
legion_python: /work/08435/srvenkat/ls6/legate.core/_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src/runtime/realm/python/python_module.cc:1054: virtual void Realm::LocalPythonProcessor::execute_task(Realm::Processor::TaskFuncID, const Realm::ByteArrayRef&): Assertion `0' failed.
Signal 6 received by node 0, process 2983422 (thread 14f511066000) - obtaining backtrace
Signal 6 received by process 2983422 (thread 14f511066000) at: stack trace: 14 frames
[0] = raise at unknown file:0 [000014f78aaeba9f]
[1] = abort at unknown file:0 [000014f78aabee04]
[2] = __assert_fail_base.cold.0 at unknown file:0 [000014f78aabecd8]
[3] = __assert_fail at unknown file:0 [000014f78aae43f5]
[4] = Realm::LocalPythonProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [000014f78b41463a]
[5] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [000014f78b3aaf41]
[6] = Realm::KernelThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [000014f78b3aafd5]
[7] = Realm::PythonThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [000014f78b41740c]
[8] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [000014f78b3a9325]
[9] = Realm::PythonThreadTaskScheduler::python_scheduler_loop() at unknown file:0 [000014f78b415f1e]
[10] = Realm::KernelThread::pthread_entry(void*) at unknown file:0 [000014f78b3aed73]
[11] = start_thread at unknown file:0 [000014f7889581ce]
[12] = __clone at unknown file:0 [000014f78aad6dd2]
[13] = unknown symbol at unknown file:0 [ffffffffffffffff]
Software versions
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
I ran the
cholesky.py
example with-n 257
and expected to see the timing/flops output.Observed behavior
I got an error saying the matrix is not positive definite. This was strange since I believe the example uses an identity matrix. I do not get the error for
-n 256
or less.Example code or instructions
Stack traceback or browser console output