kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
294 stars 93 forks source link

SpGEMM errors with non-square matrix multiplication #889

Open ralberd opened 3 years ago

ralberd commented 3 years ago

Good afternoon,

I am doing development work with Sandia's Plato code and have run into an issue using the SpGEMM algorithm in Kokkos Kernels when multiplying two non-square matrices. For certain matrices, I am getting incorrect results. I have put together and attached a small test file to demonstrate the issue. This test considers three cases of non-square matrix matrix multiplication and shows that the correct result is obtained for one case but not the other two. Build instructions are contained in the process.txt file in the attached tarball and have been successfully used to build and run the test on a machine running Ubuntu 18.04.5. Interestingly, I have also built this test on the cee-lan without cuda and it returned the correct answer for all three cases.

Thanks, Ryan

nonsquare_spgemm_test.tar.gz

srajama1 commented 3 years ago

@ralberd Just to confirm, you see this error just as part of the the CUDA, right ?

@seheracer is currently working on a better CUDA implementation. We will make sure these tests work with CUDA as well.

@brian-kelley It might be better for you in the loop for longer term.

ralberd commented 3 years ago

Yes, I have only seen these issues when I build with CUDA.

srajama1 commented 3 years ago

@ralberd : Thanks ! We will reproduce and fix as part of the better CUDA implementation.

seheracer commented 3 years ago

I am looking into the issue.

ralberd commented 3 years ago

Great, thank you! Do you have an estimate on the timeframe?

seheracer commented 3 years ago

I will try to fix it as soon as I can, but I could not reproduce the problem (see below). Which CUDA and GPU are you using?

B1 * A1:

0
4
8

0
1
2
3
0
1
2
3

2
1
1
3
2
1
1
3
Matches gold row map: 1
Matches gold col map: 1
Matches gold entries: 1
A1 * B2:

0
2
4

0
1
0
1

4
4
3
3
Matches gold row map: 1
Matches gold col map: 1
Matches gold entries: 1
A2 * B2:

0
2
4
6
8

0
1
0
1
0
1
0
1

2
2
1
1
3
3
4
4
Matches gold row map: 1
Matches gold col map: 1
Matches gold entries: 1
done.
seheracer commented 3 years ago

It would be great if you can also provide the SHAs of your kokkos and kokkos-kernels builds.

ralberd commented 3 years ago

I am using CUDA nvcc 10.2.89 on an Nvidia Quadro GV100 GPU.

As for the kokkos and kokkos-kernels SHAs, I am not sure how to find them since I am using a specific trilinos commit that maps to release 12.18.1. The spack version statement is: version('12.18.1', commit='55a75997332636a28afc9db1aee4ae46fe8d93e7') # tag trilinos-release-12-8-1​

Can that be mapped to specific kokkos and kokkos-kernels SHAs?

seheracer commented 3 years ago

@ndellingwood Is there a way to map the above-mentioned Trilinos SHA to kokkos and kokkos-kernels SHAs?

lucbv commented 3 years ago

I think you can figure this out by going back to the Trilinos version and look at what was the Kokkos release was at the time (there should be release notes in the Kokkos package in Trilinos I believe)?

seheracer commented 3 years ago

Never mind, I will try to use Trilinos release 12.18.1 instead.

seheracer commented 3 years ago

@ralberd I am following the steps you provided in process.txt. The third step is failing for me (spack install trilinos+cuda ^nvcc-wrapper compute_capability=70)


==> 337871: Installing perl
==> Fetching http://www.cpan.org/src/5.0/perl-5.30.1.tar.gz
######################################################################## 100.0%
==> Fetching http://search.cpan.org/CPAN/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7042.tar.gz
######################################################################## 100.0%
curl: (60) Peer's certificate issuer has been marked as not trusted by the user.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.
==> Failed to fetch file from URL: http://search.cpan.org/CPAN/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7042.tar.gz
    Curl was unable to fetch due to invalid certificate. This is either an attack, or your cluster's SSL configuration is bad.  If you believe your SSL configuration is bad, you can try running spack -k, which will not check SSL certificates.Use this at your own risk.
==> Fetching from http://search.cpan.org/CPAN/authors/id/M/MI/MIYAGAWA/App-cpanminus-1.7042.tar.gz failed.
==> Error: FetchError: All fetchers failed for resource-cpanm-fdbvote7xvfyajeffxc6bwvuuesns5b3

/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py:1127, in do_fetch:
       1124                raise FetchError("Will not fetch %s" %
       1125                                 self.spec.format('{name}{@version}'), ck_msg)
       1126
  >>   1127        self.stage.create()
       1128        self.stage.fetch(mirror_only)
       1129        self._fetch_time = time.time() - start_time
       1130

==> Error: Failed to install perl due to ChildError: FetchError: All fetchers failed for resource-cpanm-fdbvote7xvfyajeffxc6bwvuuesns5b3
/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py:1127, in do_fetch:
       1124                raise FetchError("Will not fetch %s" %
       1125                                 self.spec.format('{name}{@version}'), ck_msg)
       1126
  >>   1127        self.stage.create()
       1128        self.stage.fetch(mirror_only)
       1129        self._fetch_time = time.time() - start_time
       1130

Traceback (most recent call last):
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/build_environment.py", line 801, in child_process
    return_value = function()
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/installer.py", line 1046, in build_process
    pkg.do_patch()
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py", line 1167, in do_patch
    self.do_stage()
  File "/home/sacer/nonsquare_spgemm_test/spack/var/spack/repos/builtin/packages/perl/package.py", line 97, in do_stage
    # Add write permissions on file to be patched
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py", line 1152, in do_stage
    self.do_fetch(mirror_only)
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/package.py", line 1128, in do_fetch
    self.stage.fetch(mirror_only)
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/util/pattern.py", line 68, in getter
    getattr(item, self.name)(*args, **kwargs)
  File "/home/sacer/nonsquare_spgemm_test/spack/lib/spack/spack/stage.py", line 476, in fetch
    raise fs.FetchError(err_msg, None)
FetchError: All fetchers failed for resource-cpanm-fdbvote7xvfyajeffxc6bwvuuesns5b3

==> Warning: Skipping build of openssl since perl failed
==> Warning: Skipping build of cmake since openssl failed
==> Warning: Skipping build of glm since cmake failed
==> Error: Installation of trilinos failed.  Review log for details

This is probably because this is the first time I use spack to install trilinos.

Any suggestions?

ralberd commented 3 years ago

Where are you building? I have seen a similar error when building on the ascicgpu machines and have a workaround for it.

seheracer commented 3 years ago

It is a machine called kokkos-dev2 with x86+Volta.

ralberd commented 3 years ago

Try building it following the instructions in the attached `process.txt'. process.txt

seheracer commented 3 years ago

@ralberd I ran the following command: spack install trilinos+cuda %gcc@7.2.0 ^nvcc-wrapper compute_capability=70 ^openmpi@1.10.1 ^netlib-lapack >& output My loaded modules are:

Currently Loaded Modulefiles:
  1) sems-env                         3) sems-cmake/3.12.2                5) sems-gcc/7.2.0                   7) sems-openmpi/1.10.1              9) sems-netcdf/4.4.1/exo_parallel  11) sems-boost/1.59.0/base
  2) sems-git/2.10.1                  4) sems-ninja_fortran/1.8.2         6) sems-cuda/9.2                    8) sems-hdf5/1.8.12/parallel       10) sems-zlib/1.2.8/base            12) sems-superlu/4.3/base

Output is here: output.txt

ralberd commented 3 years ago

I have seen similar errors when building on the CEE-LAN if my proxies aren't correctly set.

What is the output when you do env | grep proxy? It should return

http_proxy=https://wwwproxy.sandia.gov:80 https_proxy=https://wwwproxy.sandia.gov:80

Also, try curl https://www.x.org/archive/individual/util/util-macros-1.19.1.tar.bz2 -o test.tar.bz2​ to see if it works.

seheracer commented 3 years ago

The output of env | grep proxy:

http_proxy=http://sonproxy.sandia.gov:80
https_proxy=http://sonproxy.sandia.gov:80
HTTPS_PROXY=https_proxy
no_proxy=127.0.0.1,localhost,.sandia.gov
HTTP_PROXY=http://sonproxy.sandia.gov:80

The curl command did not work either.

Meanwhile I tried following your instructions on another machine, but couldn't get the build working there either.

I gave up on spack. I will build Trilinos 12.18.1 with my usual build scripts and try to reproduce the issue.

Meanwhile, can you try running your test with a Trilinos build with cusparse disabled?

ralberd commented 3 years ago

cusparse should already be disabled in the Trilinos build I'm using. See the following console output:

Final set of non-enabled TPLs: MKL yaml-cpp Peano CUSPARSE Thrust Cusp TBB Pthread HWLOC QTHREAD BinUtils ARPREC QD Scotch OVIS gpcd METIS MTMETIS ParMETIS PuLP TopoManager LibTopoMap PaToH CppUnit ADOLC ADIC TVMET MF ExodusII Nemesis XDMF HDF5 CGNS ADIOS2 y12m SuperLUDist SuperLUMT SuperLU Cholmod UMFPACK MA28 AMD CSparse HYPRE PETSC BLACS SCALAPACK MUMPS PARDISO_MKL PARDISO Oski TAUCS ForUQTK Dakota HIPS MATLAB CASK SPARSKIT QT gtest BoostAlbLib OpenNURBS Portals CrayPortals Gemini InfiniBand BGPDCMF BGQPAMI Pablo HPCToolkit Clp GLPK qpOASES PAPI MATLABLib Eigen X11 Lemon GLM quadmath CAMAL RTlib AmgX CGAL CGALCore VTune TASMANIAN ArrayFireCPU SimMesh SimModel SimParasolid SimAcis SimField Valgrind QUO ViennaCL Avatar mlpack pebbl MAGMASparse Check 101

ralberd commented 3 years ago

Just checking in to see if there has been any progress

seheracer commented 3 years ago

Hi Ryan,

Thanks for checking in. I built Trilinos 12.18.1 and ran your driver with the kokkos and kokkos-kernels libraries in that build. Unfortunately, I can't reproduce the issue, the results are always correct for me. @srajama1 any suggestions?

ralberd commented 3 years ago

That's interesting...

What version of CUDA are you using? Also, what was the Trilinos build configuration you used?

srajama1 commented 3 years ago

The only thing I can think of is CUDA version, configuration options that @ralberd is already checking. Let us make sure both of you have the same configurations. @ralberd Do you have access to our testbed machines (weaver, white) ?

ralberd commented 3 years ago

I do not currently have access to those machines, but can request it through WebCARS.

seheracer commented 3 years ago

On white:

  1. I downloaded trilinos-release-12-18-1 from https://github.com/trilinos/Trilinos/releases.

  2. In Trilinos, I did: source cmake/std/atdm/load-env.sh cuda-10.1.105-opt-Pascal60. This loads the following modules:

    1) cuda/10.1.105                                            6) openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105                   11) cgns/20190329/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105     16) parmetis/4.0.3/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105
    2) ucx/1.5.1                                                7) pnetcdf/1.9.0/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105     12) superlu/4.3.0/gcc/7.2.0                                 17) cmake/3.12.3
    3) binutils/2.30.0                                          8) zlib/1.2.8                                              13) openblas/0.3.4/gcc/7.4.0                                18) git/2.10.1
    4) gcc/7.2.0                                                9) hdf5/1.10.5/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105       14) boost/1.65.1/gcc/7.2.0                                  19) valgrind/3.12.0
    5) papi/5.6.0                                              10) netcdf/4.6.1/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105      15) metis/5.0.1/gcc/7.2.0                                   20) devpack/20190404/openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105
  3. I built and installed Trilinos with the following:

    cmake \
    -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
    -DCMAKE_BUILD_TYPE:STRING=RELEASE \
    -DCMAKE_INSTALL_PREFIX:FILEPATH=/ascldap/users/sacer/Trilinos-trilinos-release-12-18-1/install \
    -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
    -DTrilinos_ENABLE_TESTS:BOOL=OFF \
    -DTrilinos_ENABLE_EXAMPLES:BOOL=OFF \
    -DTrilinos_ENABLE_KokkosKernels:BOOL=ON \
    -DTPL_ENABLE_MPI:BOOL=ON \
    -DTPL_ENABLE_CUDA:BOOL=ON \
    -DTPL_ENABLE_CUSPARSE:BOOL=OFF \
    -DKokkos_ENABLE_Cuda:BOOL=ON \
    -DKokkos_ENABLE_Cuda_Lambda:BOOL=ON \
    -DKokkos_ENABLE_Cuda_UVM:BOOL=ON \
    -DKokkosKernels_ENABLE_TESTS:BOOL=ON \
    -DKokkosKernels_ENABLE_EXAMPLES:BOOL=ON \
    ../
  4. I compiled the driver as follows: ../Trilinos-trilinos-release-12-18-1/packages/kokkos/bin/nvcc_wrapper main.cpp -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -L/ascldap/users/sacer/Trilinos-trilinos-release-12-18-1/install/lib/ -lkokkoskernels -lkokkoscore -lkokkoscontainers -I/ascldap/users/sacer/Trilinos-trilinos-release-12-18-1/install/include

  5. The SpGEMM results are correct:

    
    B1 * A1:

0 4 8

0 1 2 3 0 1 2 3

2 1 1 3 2 1 1 3 Matches gold row map: 1 Matches gold col map: 1 Matches gold entries: 1 A1 * B2:

0 2 4

0 1 0 1

4 4 3 3 Matches gold row map: 1 Matches gold col map: 1 Matches gold entries: 1 A2 * B2:

0 2 4 6 8

0 1 0 1 0 1 0 1

2 2 1 1 3 3 4 4 Matches gold row map: 1 Matches gold col map: 1 Matches gold entries: 1 done.

ralberd commented 3 years ago

Okay, thanks. Working on building with that configuration. One difference I noticed that was different was having UVM on. Could that play a role?

seheracer commented 3 years ago

I am not sure if Trilinos can be built with UVM being off yet. I know there is some progress on it so I gave it a try by providing: -DKokkos_ENABLE_Cuda_UVM:BOOL=OFF \ but the CMake output still contains the following:

=======================
KokkosKernels ETI Types
   Devices:  <Cuda,CudaSpace>;<Cuda,CudaUVMSpace>;<Serial,HostSpace>

The SpGEMM results are still correct.

lucbv commented 3 years ago

UVM off is not supported in Trilinos yet and specifying it in Kokkos is not doing much. I would strongly advise against it and if a build is doing it and reporting erroneous behavior I would suggest removing it as a first step toward debugging.

ralberd commented 3 years ago

I was able to build on white following the steps you detailed above and I got the test to pass, same as you.

I am seeing the issue when I build on a machine called lumbergh running Ubuntu 18, and have also seen it on a personal machine running Ubuntu 18. I wonder if the issue could be related to the OS.

I can get you access to lumbergh through the SRN, I would just need a username. That way you could build there to see the issue.

brian-kelley commented 3 years ago

(this is already in the email thread but) I didn't replicate this on Vortex, with CudaSpace or CudaUVMSpace, and with Release or RelWithDebInfo builds. I used matching CUDA (10.2.89) and Trilinos (12.18.1) versions.

brian-kelley commented 3 years ago

Update: this ended up not being a code bug, but a bug in the trilinos spack configuration. It didn't enable the cmake flag Kokkos_ARCH_VOLTA70 flag, but nvcc_wrapper was still using compute capability 7.0. So the KOKKOS_ARCH_VOLTA macro wasn't defined and the volta-specific codepath in spgemm wasn't getting taken.