gunrock / graphblast

High-Performance Linear Algebra-based Graph Primitives on GPUs
Apache License 2.0
211 stars 27 forks source link

Graphblast with sms >= 70 #9

Open jesunsahariar opened 4 years ago

jesunsahariar commented 4 years ago

Hello,

Thank you for hosting Graphblast on a public repo to help the research community.

I was wondering whether there is any plan to get GraphBlast working for the latest sms. I am finding the mgpu version leveraged by GraphBlast a little bit challenging to get it to work on latest sms for Graphblast. I tried to put in some patches in the mgpu version currently being used by Grtaphblast, in particular, for the synchronization primitives (mostly shuffles and ballots suggested by @neoblizz in the mgpu repo). and I am encountering hangs for algorithms such as bfs with matrices of medium size.

I would really appreciate any insight. Thanks in advance!

neoblizz commented 4 years ago

Just a note, @YuxinxinChen might have looked into this in the past.

YuxinxinChen commented 4 years ago

Replacing the intrinsics.cuh file here: https://github.com/ctcyang/moderngpu/blob/9e491c383e935c2cbc0279350640dad3febb8b9d/include/device/intrinsics.cuh by intrinsics.cuh here: https://github.com/moderngpu/moderngpu/blob/5029d38cab83492d8091cce5902c077ab3ca72a9/include/device/intrinsics.cuh might solve the problem

jesunsahariar commented 4 years ago

Replacing the intrinsics.cuh file here: https://github.com/ctcyang/moderngpu/blob/9e491c383e935c2cbc0279350640dad3febb8b9d/include/device/intrinsics.cuh by intrinsics.cuh here: https://github.com/moderngpu/moderngpu/blob/5029d38cab83492d8091cce5902c077ab3ca72a9/include/device/intrinsics.cuh might solve the problem

Thats pretty much what I did with @neoblizz 's patches for moderngpu. Were you able to run any graph kernel of GraphBlast with a reasonably-sized matrix with sm>=70 once you replaced the intrinsics.cuh file ? If yes, could you please let me know which kernels were you able to run and perhaps the inputs you used? Thanks in advance for your reply.

neoblizz commented 4 years ago

@jsfiroz it may be helpful to see what specific error you got. :)

jesunsahariar commented 4 years ago

Hi,

Apologies for the late response. Here is the details of the modifications I made and the problem I am currently encountering:

I have modified the following files of mgpu to compile and run on devices with sm>=70 (mostly related to ballot and shfl):

include/device/ctascan.cuh 
include/device/ctasegscan.cuh 
include/device/intrinsics.cuh

I made some changes to the CMake file since I needed relocatable code to be generated here:

Next I compiled and ran gbfs with the following command with delaunay_n10 matrix, downloaded from https://sparse.tamu.edu/DIMACS10/delaunay_n10:

./bin/gbfs --timing 0 --earlyexit 1 --mxvmode 0 --struconly 1 --niter 1 --opreuse 1  --debug 1 graphblast/data/mydata/delaunay_n10/delaunay_n10.mtx 

The program hangs, presumably in one of the branches here:

I am running on a GeForce RTX 2080 Ti GPU.

Any feedback would be greatly appreciated. Please let me know if I can provide any additional information. Thanks in advance!

partial o/p and the program hangs in one of the aforementioned branch (perhaps something related to reducebykey operation in mgpu?):

===Begin assign===
Input: 1
Executing assignDense
Mask: 1
Accum:0
SCMP: 0
Repl: 0
Tran: 0
mask_ind:
[0]:0
mask_val:
[0]:1
w_val:
[0]:1 [1]:0 [2]:0 [3]:0 [4]:0 [5]:0 [6]:0 [7]:0 [8]:0 [9]:0 [10]:0 [11]:0 [12]:0 [13]:0 [14]:0 [15]:0 [16]:0 [17]:0 [18]:0 [19]:0 [20]:0 [21]:0 [22]:0 [23]:0 [24]:0 [25]:0 [26]:0 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:0 [33]:0 [34]:0 [35]:0 [36]:0 [37]:0 [38]:0 [39]:0
===End assign===
val:
[0]:1 [1]:0 [2]:0 [3]:0 [4]:0 [5]:0 [6]:0 [7]:0 [8]:0 [9]:0 [10]:0 [11]:0 [12]:0 [13]:0 [14]:0 [15]:0 [16]:0 [17]:0 [18]:0 [19]:0 [20]:0 [21]:0 [22]:0 [23]:0 [24]:0 [25]:0 [26]:0 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:0 [33]:0 [34]:0 [35]:0 [36]:0 [37]:0 [38]:0 [39]:0
===Begin vxm===
ind:
[0]:0
val:
[0]:1
Load balance mode: 2
Identity: 0
Sparse format: 0
Symmetric: 0
u_vec_type: 1
Executing Spmspv MERGE
In structure only mode
Mask: 1
Accum:0
SCMP: 0
Repl: 0
Tran: 1
NT: 128 NB: 1
d_temp_nvals:
[0]:9
d_scan:
[0]:0 [1]:9
u_nvals: 1
w_nvals: 9
SwapInd:
[0]:64 [1]:242 [2]:299 [3]:301 [4]:302 [5]:303 [6]:305 [7]:315 [8]:317
1 bytes required!
TempInd:
[0]:64 [1]:242 [2]:299 [3]:301 [4]:302 [5]:303 [6]:305 [7]:315 [8]:317
ctcyang commented 4 years ago

Thanks @neoblizz and @YuxinxinChen for your help!

With regards to your issue @jsfiroz, so far as I can tell the problem is with moderngpu not supporting the new _sync variants of CUDA 10.1 and up. If I try that dataset, it gets the d_scan result wrong, which is an output of mgpu::Scan (I tried with both mgpu::Scan and my modified mgpu::ScanPrealloc and neither of them get the right answer). My guess is moderngpu relies on some assumptions regarding synchronization which are not being met anymore with the new _sync variants. As a result, some of its public methods like Scan or ReduceByKey are not giving the right result.

The easiest solution I can think of compiling with setting to sm_70 and setting environmental variable export CUDA_HOME=/usr/local/cuda-10.0 or lower. For that, I get the correct solution:

./bin/gbfs --timing 0 --earlyexit 1 --mxvmode 0 --struconly 1 --niter 1 --opreuse 1 --debug 0 /data/gunrock_dataset/large/delaunay_n10/delaunay_n10.mtx
Undirected due to mtx: 1
Undirected due to cmd: 0
Undirected: 1
Remove self-loop: 1
Reading /data/gunrock_dataset/large/delaunay_n10/.delaunay_n10.mtx.ud.nosl.bin
Allocate 1025
Allocate 7334
Allocate 7334
CPU BFS finished in 0.032187 msec. Search depth is: 18

CORRECT
cpu, 0.0460148,
warmup, 2.14911, 0
tight, 1.70672
vxm, 1.97005

CORRECT