Open ralberd opened 3 years ago
@ralberd Just to confirm, you see this error just as part of the the CUDA, right ?
@seheracer is currently working on a better CUDA implementation. We will make sure these tests work with CUDA as well.
@brian-kelley It might be better for you in the loop for longer term.
Yes, I have only seen these issues when I build with CUDA.
@ralberd : Thanks ! We will reproduce and fix as part of the better CUDA implementation.
I am looking into the issue.
Great, thank you! Do you have an estimate on the timeframe?
I will try to fix it as soon as I can, but I could not reproduce the problem (see below). Which CUDA and GPU are you using?
B1 * A1:
0
4
8
0
1
2
3
0
1
2
3
2
1
1
3
2
1
1
3
Matches gold row map: 1
Matches gold col map: 1
Matches gold entries: 1
A1 * B2:
0
2
4
0
1
0
1
4
4
3
3
Matches gold row map: 1
Matches gold col map: 1
Matches gold entries: 1
A2 * B2:
0
2
4
6
8
0
1
0
1
0
1
0
1
2
2
1
1
3
3
4
4
Matches gold row map: 1
Matches gold col map: 1
Matches gold entries: 1
done.
It would be great if you can also provide the SHAs of your kokkos and kokkos-kernels builds.
I am using CUDA nvcc 10.2.89 on an Nvidia Quadro GV100 GPU.
As for the kokkos and kokkos-kernels SHAs, I am not sure how to find them since I am using a specific trilinos commit that maps to release 12.18.1. The spack version statement is:
version('12.18.1', commit='55a75997332636a28afc9db1aee4ae46fe8d93e7') # tag trilinos-release-12-8-1
Can that be mapped to specific kokkos and kokkos-kernels SHAs?
@ndellingwood Is there a way to map the above-mentioned Trilinos SHA to kokkos and kokkos-kernels SHAs?
I think you can figure this out by going back to the Trilinos version and look at what was the Kokkos release was at the time (there should be release notes in the Kokkos package in Trilinos I believe)?
Never mind, I will try to use Trilinos release 12.18.1 instead.
@ralberd I am following the steps you provided in process.txt. The third step is failing for me (spack install trilinos+cuda ^nvcc-wrapper compute_capability=70)
==> 337871: Installing perl
==> Fetching http://www.cpan.org/src/5.0/perl-5.30.1.tar.gz
######################################################################## 100.0%
==> Fetching http://search.cpan.org/CPAN/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7042.tar.gz
######################################################################## 100.0%
curl: (60) Peer's certificate issuer has been marked as not trusted by the user.
More details here: http://curl.haxx.se/docs/sslcerts.html
curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.
==> Failed to fetch file from URL: http://search.cpan.org/CPAN/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7042.tar.gz
Curl was unable to fetch due to invalid certificate. This is either an attack, or your cluster's SSL configuration is bad. If you believe your SSL configuration is bad, you can try running spack -k, which will not check SSL certificates.Use this at your own risk.
==> Fetching from http://search.cpan.org/CPAN/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7042.tar.gz failed.
==> Error: FetchError: All fetchers failed for resource-cpanm-fdbvote7xvfyajeffxc6bwvuuesns5b3
/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py:1127, in do_fetch:
1124 raise FetchError("Will not fetch %s" %
1125 self.spec.format('{name}{@version}'), ck_msg)
1126
>> 1127 self.stage.create()
1128 self.stage.fetch(mirror_only)
1129 self._fetch_time = time.time() - start_time
1130
==> Error: Failed to install perl due to ChildError: FetchError: All fetchers failed for resource-cpanm-fdbvote7xvfyajeffxc6bwvuuesns5b3
/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py:1127, in do_fetch:
1124 raise FetchError("Will not fetch %s" %
1125 self.spec.format('{name}{@version}'), ck_msg)
1126
>> 1127 self.stage.create()
1128 self.stage.fetch(mirror_only)
1129 self._fetch_time = time.time() - start_time
1130
Traceback (most recent call last):
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/build_environment.py", line 801, in child_process
return_value = function()
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/installer.py", line 1046, in build_process
pkg.do_patch()
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py", line 1167, in do_patch
self.do_stage()
File "/home/sacer/nonsquare_spgemm_test/spack/var/spack/repos/builtin/packages/perl/package.py", line 97, in do_stage
# Add write permissions on file to be patched
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py", line 1152, in do_stage
self.do_fetch(mirror_only)
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py", line 1128, in do_fetch
self.stage.fetch(mirror_only)
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/util/pattern.py", line 68, in getter
getattr(item, self.name)(*args, **kwargs)
File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/stage.py", line 476, in fetch
raise fs.FetchError(err_msg, None)
FetchError: All fetchers failed for resource-cpanm-fdbvote7xvfyajeffxc6bwvuuesns5b3
==> Warning: Skipping build of openssl since perl failed
==> Warning: Skipping build of cmake since openssl failed
==> Warning: Skipping build of glm since cmake failed
==> Error: Installation of trilinos failed. Review log for details
This is probably because this is the first time I use spack to install trilinos.
Any suggestions?
Where are you building? I have seen a similar error when building on the ascicgpu machines and have a workaround for it.
It is a machine called kokkos-dev2 with x86+Volta.
Try building it following the instructions in the attached `process.txt'. process.txt
@ralberd I ran the following command:
spack install trilinos+cuda %gcc@7.2.0 ^nvcc-wrapper compute_capability=70 ^openmpi@1.10.1 ^netlib-lapack >& output
My loaded modules are:
Currently Loaded Modulefiles:
1) sems-env 3) sems-cmake/3.12.2 5) sems-gcc/7.2.0 7) sems-openmpi/1.10.1 9) sems-netcdf/4.4.1/exo_parallel 11) sems-boost/1.59.0/base
2) sems-git/2.10.1 4) sems-ninja_fortran/1.8.2 6) sems-cuda/9.2 8) sems-hdf5/1.8.12/parallel 10) sems-zlib/1.2.8/base 12) sems-superlu/4.3/base
Output is here: output.txt
I have seen similar errors when building on the CEE-LAN if my proxies aren't correctly set.
What is the output when you do env | grep proxy
? It should return
http_proxy=https://wwwproxy.sandia.gov:80
https_proxy=https://wwwproxy.sandia.gov:80
Also, try
curl https://www.x.org/archive/individual/util/util-macros-1.19.1.tar.bz2 -o test.tar.bz2
to see if it works.
The output of env | grep proxy
:
http_proxy=http://sonproxy.sandia.gov:80
https_proxy=http://sonproxy.sandia.gov:80
HTTPS_PROXY=https_proxy
no_proxy=127.0.0.1,localhost,.sandia.gov
HTTP_PROXY=http://sonproxy.sandia.gov:80
The curl command did not work either.
Meanwhile I tried following your instructions on another machine, but couldn't get the build working there either.
I gave up on spack. I will build Trilinos 12.18.1 with my usual build scripts and try to reproduce the issue.
Meanwhile, can you try running your test with a Trilinos build with cusparse disabled?
cusparse should already be disabled in the Trilinos build I'm using. See the following console output:
Final set of non-enabled TPLs: MKL yaml-cpp Peano CUSPARSE Thrust Cusp TBB Pthread HWLOC QTHREAD BinUtils ARPREC QD Scotch OVIS gpcd METIS MTMETIS ParMETIS PuLP TopoManager LibTopoMap PaToH CppUnit ADOLC ADIC TVMET MF ExodusII Nemesis XDMF HDF5 CGNS ADIOS2 y12m SuperLUDist SuperLUMT SuperLU Cholmod UMFPACK MA28 AMD CSparse HYPRE PETSC BLACS SCALAPACK MUMPS PARDISO_MKL PARDISO Oski TAUCS ForUQTK Dakota HIPS MATLAB CASK SPARSKIT QT gtest BoostAlbLib OpenNURBS Portals CrayPortals Gemini InfiniBand BGPDCMF BGQPAMI Pablo HPCToolkit Clp GLPK qpOASES PAPI MATLABLib Eigen X11 Lemon GLM quadmath CAMAL RTlib AmgX CGAL CGALCore VTune TASMANIAN ArrayFireCPU SimMesh SimModel SimParasolid SimAcis SimField Valgrind QUO ViennaCL Avatar mlpack pebbl MAGMASparse Check 101
Just checking in to see if there has been any progress
Hi Ryan,
Thanks for checking in. I built Trilinos 12.18.1 and ran your driver with the kokkos and kokkos-kernels libraries in that build. Unfortunately, I can't reproduce the issue, the results are always correct for me. @srajama1 any suggestions?
That's interesting...
What version of CUDA are you using? Also, what was the Trilinos build configuration you used?
The only thing I can think of is CUDA version, configuration options that @ralberd is already checking. Let us make sure both of you have the same configurations. @ralberd Do you have access to our testbed machines (weaver, white) ?
I do not currently have access to those machines, but can request it through WebCARS.
On white:
I downloaded trilinos-release-12-18-1 from https://github.com/trilinos/Trilinos/releases.
In Trilinos, I did: source cmake/std/atdm/load-env.sh cuda-10.1.105-opt-Pascal60
. This loads the following modules:
1) cuda/10.1.105 6) openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105 11) cgns/20190329/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105 16) parmetis/4.0.3/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105
2) ucx/1.5.1 7) pnetcdf/1.9.0/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105 12) superlu/4.3.0/gcc/7.2.0 17) cmake/3.12.3
3) binutils/2.30.0 8) zlib/1.2.8 13) openblas/0.3.4/gcc/7.4.0 18) git/2.10.1
4) gcc/7.2.0 9) hdf5/1.10.5/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105 14) boost/1.65.1/gcc/7.2.0 19) valgrind/3.12.0
5) papi/5.6.0 10) netcdf/4.6.1/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105 15) metis/5.0.1/gcc/7.2.0 20) devpack/20190404/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105
I built and installed Trilinos with the following:
cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DCMAKE_BUILD_TYPE:STRING=RELEASE \
-DCMAKE_INSTALL_PREFIX:FILEPATH=/ascldap/users/sacer/Trilinos-trilinos-release-12-18-1/install \
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
-DTrilinos_ENABLE_TESTS:BOOL=OFF \
-DTrilinos_ENABLE_EXAMPLES:BOOL=OFF \
-DTrilinos_ENABLE_KokkosKernels:BOOL=ON \
-DTPL_ENABLE_MPI:BOOL=ON \
-DTPL_ENABLE_CUDA:BOOL=ON \
-DTPL_ENABLE_CUSPARSE:BOOL=OFF \
-DKokkos_ENABLE_Cuda:BOOL=ON \
-DKokkos_ENABLE_Cuda_Lambda:BOOL=ON \
-DKokkos_ENABLE_Cuda_UVM:BOOL=ON \
-DKokkosKernels_ENABLE_TESTS:BOOL=ON \
-DKokkosKernels_ENABLE_EXAMPLES:BOOL=ON \
../
I compiled the driver as follows:
../Trilinos-trilinos-release-12-18-1/packages/kokkos/bin/nvcc_wrapper main.cpp -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -L/ascldap/users/sacer/Trilinos-trilinos-release-12-18-1/install/lib/ -lkokkoskernels -lkokkoscore -lkokkoscontainers -I/ascldap/users/sacer/Trilinos-trilinos-release-12-18-1/install/include
The SpGEMM results are correct:
B1 * A1:
0 4 8
0 1 2 3 0 1 2 3
2 1 1 3 2 1 1 3 Matches gold row map: 1 Matches gold col map: 1 Matches gold entries: 1 A1 * B2:
0 2 4
0 1 0 1
4 4 3 3 Matches gold row map: 1 Matches gold col map: 1 Matches gold entries: 1 A2 * B2:
0 2 4 6 8
0 1 0 1 0 1 0 1
2 2 1 1 3 3 4 4 Matches gold row map: 1 Matches gold col map: 1 Matches gold entries: 1 done.
Okay, thanks. Working on building with that configuration. One difference I noticed that was different was having UVM on. Could that play a role?
I am not sure if Trilinos can be built with UVM being off yet. I know there is some progress on it so I gave it a try by providing:
-DKokkos_ENABLE_Cuda_UVM:BOOL=OFF \
but the CMake output still contains the following:
=======================
KokkosKernels ETI Types
Devices: <Cuda,CudaSpace>;<Cuda,CudaUVMSpace>;<Serial,HostSpace>
The SpGEMM results are still correct.
UVM off is not supported in Trilinos yet and specifying it in Kokkos is not doing much. I would strongly advise against it and if a build is doing it and reporting erroneous behavior I would suggest removing it as a first step toward debugging.
I was able to build on white following the steps you detailed above and I got the test to pass, same as you.
I am seeing the issue when I build on a machine called lumbergh running Ubuntu 18, and have also seen it on a personal machine running Ubuntu 18. I wonder if the issue could be related to the OS.
I can get you access to lumbergh through the SRN, I would just need a username. That way you could build there to see the issue.
(this is already in the email thread but) I didn't replicate this on Vortex, with CudaSpace or CudaUVMSpace, and with Release or RelWithDebInfo builds. I used matching CUDA (10.2.89) and Trilinos (12.18.1) versions.
Update: this ended up not being a code bug, but a bug in the trilinos spack configuration. It didn't enable the cmake flag Kokkos_ARCH_VOLTA70
flag, but nvcc_wrapper was still using compute capability 7.0. So the KOKKOS_ARCH_VOLTA
macro wasn't defined and the volta-specific codepath in spgemm wasn't getting taken.
Good afternoon,
I am doing development work with Sandia's Plato code and have run into an issue using the SpGEMM algorithm in Kokkos Kernels when multiplying two non-square matrices. For certain matrices, I am getting incorrect results. I have put together and attached a small test file to demonstrate the issue. This test considers three cases of non-square matrix matrix multiplication and shows that the correct result is obtained for one case but not the other two. Build instructions are contained in the
process.txt
file in the attached tarball and have been successfully used to build and run the test on a machine running Ubuntu 18.04.5. Interestingly, I have also built this test on the cee-lan without cuda and it returned the correct answer for all three cases.Thanks, Ryan
nonsquare_spgemm_test.tar.gz