nv-legate / cupynumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cupynumeric
Apache License 2.0
623 stars 71 forks source link

Cunumeric Dot/Multiply Returns Zero Matrix But Numpy Dot/Multiply Returns Nonzero Matrix #1149

Closed wlai0611 closed 2 weeks ago

wlai0611 commented 3 months ago

I was using cunumeric to get the singular vectors and values for a 4900 row by 100 column matrix in which each column is a flattened 70 by 70 image of 3 particles approaching each other (attached here lj.csv ).

After I obtain the singular vectors and values, I attempt reconstruct the 4900 x 100 matrix using cunumeric's dot and multiply functions but the resulting matrix product is a zero matrix whereas using Numpy's dot and multiply results in a nonzero matrix.

The process I describe above is coded below:

import cunumeric as cu
import numpy as np
import pathlib
X = cu.array(np.genfromtxt('lj.csv',delimiter=','))
print('Dataset Head')
print(X[:5,:5])
print(X.shape)
u,s,v  = cu.linalg.svd(X)
rank   = 5
us     = cu.multiply(u[:,:rank],s[:rank])
usv    = cu.dot(us,v[:rank])
print('Cunumeric reconstruction',usv[:5,:5])
print('Cunumeric Reconstruction Sum',usv.sum())
u = np.array(u)
s = np.array(s)
v = np.array(v)
us  = np.multiply(u[:,:rank],s[:rank])
usv = np.dot(us,v[:rank])
print('Numpy Reconstruction', usv[:5,:5])
print('Numpy reconstruct sum', usv.sum())

I run the py file with

legate --gpus 1 --sysmem 3000 --fbmem 3000 dot_product.py

My outputs:

[0 - 7fa730c91000]    0.000058 {4}{threads}: reservation ('Python-1 proc 1d00000000000006') cannot be satisfied
Dataset Head
[[73. 73. 73. 73. 72.]
 [73. 73. 73. 73. 72.]
 [74. 74. 74. 74. 74.]
 [73. 74. 73. 73. 73.]
 [73. 73. 73. 73. 73.]]
(4900, 100)
Cunumeric reconstruction [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
Cunumeric Reconstruction Sum 0.0
Numpy Reconstruction [[72.81182832 72.52941609 72.60097683 72.74904052 72.37250973]
 [73.14858434 72.85557594 72.94322608 73.07393697 72.72770279]
 [74.37506928 74.05718073 74.17241835 74.26597076 74.00117492]
 [73.5444478  73.23563627 73.3039867  73.40736981 73.18955462]
 [73.32200982 73.01808164 73.07275197 73.18442356 72.96220854]]
Numpy reconstruct sum 35497911.706290424

My hardware specs are below:

Python      :  3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
Platform    :  Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Legion      :  legion-24.03.0-456-g75074815f
Legate      :  24.06.00
[0 - 7f13ea1df740]    0.000051 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
Cunumeric   :  24.06.00
Numpy       :  1.26.4
Scipy       :  1.14.0
Numba       :  (failed to detect)
CTK package :  cuda-version-12.5-hd4f0392_3 (conda-forge)
GPU driver  :  525.105.17
GPU devices :
  GPU 0: NVIDIA GeForce GTX 980

Thank you!

manopapad commented 3 months ago

We have failed to reproduce this on an Ampere and a Volta card. I managed to get my hands on a GeForce GTX 980 (which is Maxwell) and was just able to reproduce this, so this issue appears to be architecture-specific. More investigation is necessary.

Technically speaking we only support Volta+ (because that's we test in CI), but we don't outright refuse to run under older architectures, because for the most part everything is expected to work. We know that some of our kernels require independent thread scheduling (that was introduced with Volta), but that shouldn't cause silent data corruption...

manopapad commented 3 months ago

We confirmed that the bug does not reproduce on the same hardware when using latest top-of-tree, so at some point between 24.06 and today the underlying issue was fixed. We plan to push a new top-of-tree build within the next two weeks (currently finalizing another patch release, so it will come after that). We will notify you at that point to try out the fix.

manopapad commented 1 month ago

Here's how to get a latest nightly build, that should have fixed the above issue, and supports full_matrices=True and False:

conda create -n myenv -c legate/label/experimental -c conda-forge cunumeric

Do note that we're still debugging a crash with this build, that may or may not affect you.

wlai0611 commented 1 month ago

Thanks! So the full_matrices = True worked. I think that I am still getting the dot product/multiply error above though. I was using GTX 1080 Ti and GTX Titan X which were the available nodes at the moment.

manopapad commented 1 month ago

Asking @amberhassaan to try and reproduce the failure under Pascal.

manopapad commented 1 month ago

@wlai0611 what are the latest legate and cunumeric conda package versions you used, that reproduce the product/multiply failure, just to make sure we're using the same ones?

wlai0611 commented 1 month ago

So the latest legate version was 24.09.00.dev230 and my cunumeric version was 24.09.00.dev97. Thanks!

manopapad commented 2 weeks ago

I believe the latest nightly packages have solved this, @wlai0611 could you please confirm?

wlai0611 commented 2 weeks ago

Thanks! So I ran the following command to update: conda update -n legate_experimental -c legate/label/experimental cunumeric which updated the cunumeric version below: cunumeric 24.09.00.dev97-cuda12_py312_g2217c6c8~ --> 24.09.00.dev116-cuda12_py312_g1c2b85e3_116_gpu And now the output when running on NVIDIA GeForce GTX 1080 Ti is correctly reconstructed below:

Loading conda Dataset Head [[73. 73. 73. 73. 72.] [73. 73. 73. 73. 72.] [74. 74. 74. 74. 74.] [73. 74. 73. 73. 73.] [73. 73. 73. 73. 73.]] (4900, 100) Cunumeric reconstruction [[72.81182832 72.52941609 72.60097683 72.74904052 72.37250973] [73.14858434 72.85557594 72.94322608 73.07393697 72.72770279] [74.37506928 74.05718073 74.17241835 74.26597076 74.00117492] [73.5444478 73.23563627 73.3039867 73.40736981 73.18955462] [73.32200982 73.01808164 73.07275197 73.18442356 72.96220854]] Cunumeric Reconstruction Sum 35497911.706290394 Numpy Reconstruction [[72.81182832 72.52941609 72.60097683 72.74904052 72.37250973] [73.14858434 72.85557594 72.94322608 73.07393697 72.72770279] [74.37506928 74.05718073 74.17241835 74.26597076 74.00117492] [73.5444478 73.23563627 73.3039867 73.40736981 73.18955462] [73.32200982 73.01808164 73.07275197 73.18442356 72.96220854]]

I noticed an additional output (that was not present before the update of my experimental venv) below. Should I be concerned?

[0 - 7b79cba19740] 0.000000 {4}{numa}: insufficient memory in NUMA node 0 (323452141568 > 63959773184 bytes) - skipping allocation [0 - 7b79cba19740] 0.000000 {4}{numa}: insufficient memory in NUMA node 1 (323452141568 > 37612838912 bytes) - skipping allocation [0 - 7b79cba19740] 0.000000 {4}{openmp}: not enough cores in NUMA domain 0 (4 < 20) [0 - 7b79cba19740] 0.001883 {4}{threads}: reservation ('OMP-1 proc 1d00000000000003 (worker 3)') cannot be satisfied

manopapad commented 2 weeks ago

I noticed an additional output (that was not present before the update of my experimental venv) below. Should I be concerned?

This is not concerning. We added automatic configuration, and apparently we're not parsing the NUMA configuration correctly, and setting values too high. This should be fixed soon.

wlai0611 commented 2 weeks ago

Thanks so much! I will close this issue.