Closed wlai0611 closed 2 weeks ago
We have failed to reproduce this on an Ampere and a Volta card. I managed to get my hands on a GeForce GTX 980 (which is Maxwell) and was just able to reproduce this, so this issue appears to be architecture-specific. More investigation is necessary.
Technically speaking we only support Volta+ (because that's we test in CI), but we don't outright refuse to run under older architectures, because for the most part everything is expected to work. We know that some of our kernels require independent thread scheduling (that was introduced with Volta), but that shouldn't cause silent data corruption...
We confirmed that the bug does not reproduce on the same hardware when using latest top-of-tree, so at some point between 24.06 and today the underlying issue was fixed. We plan to push a new top-of-tree build within the next two weeks (currently finalizing another patch release, so it will come after that). We will notify you at that point to try out the fix.
Here's how to get a latest nightly build, that should have fixed the above issue, and supports full_matrices=True
and False
:
conda create -n myenv -c legate/label/experimental -c conda-forge cunumeric
Do note that we're still debugging a crash with this build, that may or may not affect you.
Thanks! So the full_matrices = True worked. I think that I am still getting the dot product/multiply error above though. I was using GTX 1080 Ti and GTX Titan X which were the available nodes at the moment.
Asking @amberhassaan to try and reproduce the failure under Pascal.
@wlai0611 what are the latest legate and cunumeric conda package versions you used, that reproduce the product/multiply failure, just to make sure we're using the same ones?
So the latest legate version was 24.09.00.dev230 and my cunumeric version was 24.09.00.dev97. Thanks!
I believe the latest nightly packages have solved this, @wlai0611 could you please confirm?
Thanks! So I ran the following command to update:
conda update -n legate_experimental -c legate/label/experimental cunumeric
which updated the cunumeric version below:
cunumeric 24.09.00.dev97-cuda12_py312_g2217c6c8~ --> 24.09.00.dev116-cuda12_py312_g1c2b85e3_116_gpu
And now the output when running on NVIDIA GeForce GTX 1080 Ti is correctly reconstructed below:
Loading conda Dataset Head [[73. 73. 73. 73. 72.] [73. 73. 73. 73. 72.] [74. 74. 74. 74. 74.] [73. 74. 73. 73. 73.] [73. 73. 73. 73. 73.]] (4900, 100) Cunumeric reconstruction [[72.81182832 72.52941609 72.60097683 72.74904052 72.37250973] [73.14858434 72.85557594 72.94322608 73.07393697 72.72770279] [74.37506928 74.05718073 74.17241835 74.26597076 74.00117492] [73.5444478 73.23563627 73.3039867 73.40736981 73.18955462] [73.32200982 73.01808164 73.07275197 73.18442356 72.96220854]] Cunumeric Reconstruction Sum 35497911.706290394 Numpy Reconstruction [[72.81182832 72.52941609 72.60097683 72.74904052 72.37250973] [73.14858434 72.85557594 72.94322608 73.07393697 72.72770279] [74.37506928 74.05718073 74.17241835 74.26597076 74.00117492] [73.5444478 73.23563627 73.3039867 73.40736981 73.18955462] [73.32200982 73.01808164 73.07275197 73.18442356 72.96220854]]
I noticed an additional output (that was not present before the update of my experimental venv) below. Should I be concerned?
[0 - 7b79cba19740] 0.000000 {4}{numa}: insufficient memory in NUMA node 0 (323452141568 > 63959773184 bytes) - skipping allocation [0 - 7b79cba19740] 0.000000 {4}{numa}: insufficient memory in NUMA node 1 (323452141568 > 37612838912 bytes) - skipping allocation [0 - 7b79cba19740] 0.000000 {4}{openmp}: not enough cores in NUMA domain 0 (4 < 20) [0 - 7b79cba19740] 0.001883 {4}{threads}: reservation ('OMP-1 proc 1d00000000000003 (worker 3)') cannot be satisfied
I noticed an additional output (that was not present before the update of my experimental venv) below. Should I be concerned?
This is not concerning. We added automatic configuration, and apparently we're not parsing the NUMA configuration correctly, and setting values too high. This should be fixed soon.
Thanks so much! I will close this issue.
I was using cunumeric to get the singular vectors and values for a 4900 row by 100 column matrix in which each column is a flattened 70 by 70 image of 3 particles approaching each other (attached here lj.csv ).
After I obtain the singular vectors and values, I attempt reconstruct the 4900 x 100 matrix using cunumeric's dot and multiply functions but the resulting matrix product is a zero matrix whereas using Numpy's dot and multiply results in a nonzero matrix.
The process I describe above is coded below:
I run the py file with
My outputs:
My hardware specs are below:
Thank you!