nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
610 stars 69 forks source link

Cunumeric Dot/Multiply Returns Zero Matrix But Numpy Dot/Multiply Returns Nonzero Matrix #1149

Open wlai0611 opened 1 month ago

wlai0611 commented 1 month ago

I was using cunumeric to get the singular vectors and values for a 4900 row by 100 column matrix in which each column is a flattened 70 by 70 image of 3 particles approaching each other (attached here lj.csv ).

After I obtain the singular vectors and values, I attempt reconstruct the 4900 x 100 matrix using cunumeric's dot and multiply functions but the resulting matrix product is a zero matrix whereas using Numpy's dot and multiply results in a nonzero matrix.

The process I describe above is coded below:

import cunumeric as cu
import numpy as np
import pathlib
X = cu.array(np.genfromtxt('lj.csv',delimiter=','))
print('Dataset Head')
print(X[:5,:5])
print(X.shape)
u,s,v  = cu.linalg.svd(X)
rank   = 5
us     = cu.multiply(u[:,:rank],s[:rank])
usv    = cu.dot(us,v[:rank])
print('Cunumeric reconstruction',usv[:5,:5])
print('Cunumeric Reconstruction Sum',usv.sum())
u = np.array(u)
s = np.array(s)
v = np.array(v)
us  = np.multiply(u[:,:rank],s[:rank])
usv = np.dot(us,v[:rank])
print('Numpy Reconstruction', usv[:5,:5])
print('Numpy reconstruct sum', usv.sum())

I run the py file with

legate --gpus 1 --sysmem 3000 --fbmem 3000 dot_product.py

My outputs:

[0 - 7fa730c91000]    0.000058 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Dataset Head
[[73. 73. 73. 73. 72.]
 [73. 73. 73. 73. 72.]
 [74. 74. 74. 74. 74.]
 [73. 74. 73. 73. 73.]
 [73. 73. 73. 73. 73.]]
(4900, 100)
Cunumeric reconstruction [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
Cunumeric Reconstruction Sum 0.0
Numpy Reconstruction [[72.81182832 72.52941609 72.60097683 72.74904052 72.37250973]
 [73.14858434 72.85557594 72.94322608 73.07393697 72.72770279]
 [74.37506928 74.05718073 74.17241835 74.26597076 74.00117492]
 [73.5444478  73.23563627 73.3039867  73.40736981 73.18955462]
 [73.32200982 73.01808164 73.07275197 73.18442356 72.96220854]]
Numpy reconstruct sum 35497911.706290424

My hardware specs are below:

Python      :  3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
Platform    :  Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Legion      :  legion-24.03.0-456-g75074815f
Legate      :  24.06.00
[0 - 7f13ea1df740]    0.000051 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
Cunumeric   :  24.06.00
Numpy       :  1.26.4
Scipy       :  1.14.0
Numba       :  (failed to detect)
CTK package :  cuda-version-12.5-hd4f0392_3 (conda-forge)
GPU driver  :  525.105.17
GPU devices :
  GPU 0: NVIDIA GeForce GTX 980

Thank you!

manopapad commented 1 month ago

We have failed to reproduce this on an Ampere and a Volta card. I managed to get my hands on a GeForce GTX 980 (which is Maxwell) and was just able to reproduce this, so this issue appears to be architecture-specific. More investigation is necessary.

Technically speaking we only support Volta+ (because that's we test in CI), but we don't outright refuse to run under older architectures, because for the most part everything is expected to work. We know that some of our kernels require independent thread scheduling (that was introduced with Volta), but that shouldn't cause silent data corruption...

manopapad commented 1 month ago

We confirmed that the bug does not reproduce on the same hardware when using latest top-of-tree, so at some point between 24.06 and today the underlying issue was fixed. We plan to push a new top-of-tree build within the next two weeks (currently finalizing another patch release, so it will come after that). We will notify you at that point to try out the fix.