nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
610 stars 69 forks source link

[BUG] Cholesky example shows "matrix is not positive definite" error #1148

Open s769 opened 1 month ago

s769 commented 1 month ago

Software versions

Python      :  3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Platform    :  Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
Legion      :  v24.01.00.dev-38-g90944d7
Legate      :  24.01.00.dev+38.g90944d7
WARNING: Disabling control replication for interactive run
Disable Control Replication
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   c315-012
  Local device: mlx5_0
--------------------------------------------------------------------------
Cunumeric   :  24.01.00.dev+29.g503affb8
Numpy       :  2.0.0
Scipy       :  1.14.0
Numba       :  0.60.0
/work/08435/srvenkat/ls6/miniconda3/lib/python3.12/site-packages/conda_package_streaming/package_streaming.py:25: UserWarning: zstandard could not be imported. Running without .conda support.
  warnings.warn("zstandard could not be imported. Running without .conda support.")
/work/08435/srvenkat/ls6/miniconda3/lib/python3.12/site-packages/conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
  _warnings.warn("Install zstandard Python bindings for .conda support")
CTK package :  cuda-version-12.4-hbda6634_3 (pkgs/main)
GPU driver  :  535.104.12
GPU devices :
  GPU 0: NVIDIA A100-PCIE-40GB
  GPU 1: NVIDIA A100-PCIE-40GB
  GPU 2: NVIDIA A100-PCIE-40GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

I ran the cholesky.py example with -n 257 and expected to see the timing/flops output.

Observed behavior

I got an error saying the matrix is not positive definite. This was strange since I believe the example uses an identity matrix. I do not get the error for -n 256 or less.

Example code or instructions

legate --gpus 1 ./cholesky.py -n 257

Stack traceback or browser console output

(legate-ucx) c315-012.ls6(1033)$ legate --gpus 1 ./cholesky.py -n 257
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   c315-012
  Local device: mlx5_0
--------------------------------------------------------------------------
Elapsed Time: 52.263 ms
108267.2062453361 GOP/s
[0 - 14f511066000]    1.320818 {6}{python}: python exception occurred within task:
numpy.linalg.LinAlgError: Matrix is not positive definite

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legion_top.py", line 481, in legion_python_main
    cleanup()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 2164, in _cleanup_legate_runtime
    runtime.destroy()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 1322, in destroy
    self.raise_exceptions()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/runtime.py", line 2075, in raise_exceptions
    pending.raise_exception()
  File "/work/08435/srvenkat/ls6/miniconda3/envs/legate-ucx/lib/python3.1/site-packages/legate/core/exception.py", line 50, in raise_exception
    raise exn_reraised from exn_original
numpy.linalg.LinAlgError: Matrix is not positive definite
legion_python: /work/08435/srvenkat/ls6/legate.core/_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src/runtime/realm/python/python_module.cc:1054: virtual void Realm::LocalPythonProcessor::execute_task(Realm::Processor::TaskFuncID, const Realm::ByteArrayRef&): Assertion `0' failed.
Signal 6 received by node 0, process 2983422 (thread 14f511066000) - obtaining backtrace
Signal 6 received by process 2983422 (thread 14f511066000) at: stack trace: 14 frames
  [0] = raise at unknown file:0 [000014f78aaeba9f]
  [1] = abort at unknown file:0 [000014f78aabee04]
  [2] = __assert_fail_base.cold.0 at unknown file:0 [000014f78aabecd8]
  [3] = __assert_fail at unknown file:0 [000014f78aae43f5]
  [4] = Realm::LocalPythonProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [000014f78b41463a]
  [5] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [000014f78b3aaf41]
  [6] = Realm::KernelThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [000014f78b3aafd5]
  [7] = Realm::PythonThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [000014f78b41740c]
  [8] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [000014f78b3a9325]
  [9] = Realm::PythonThreadTaskScheduler::python_scheduler_loop() at unknown file:0 [000014f78b415f1e]
  [10] = Realm::KernelThread::pthread_entry(void*) at unknown file:0 [000014f78b3aed73]
  [11] = start_thread at unknown file:0 [000014f7889581ce]
  [12] = __clone at unknown file:0 [000014f78aad6dd2]
  [13] = unknown symbol at unknown file:0 [ffffffffffffffff]
manopapad commented 1 month ago

I am not seeing the issue on my machine with the 24.06 packages (latest available on conda), could you please check if those solve your issues?