start_pos stores the indices of neighbours; and neighbours could have |V|^2 size. Even though |V| is bounded by uint32, |V|^2 is not guaranteed. Hence it's possible to overflow start_pos at 32 bit width.
and everhwhere these two values are used; for tot_num_nbs, there is a CUDA memcpy D2H operation, need to change the size of that from 32 to 64 as well.
in calc_start_pos, last_vtx_start_pos should be 64 bit
start_pos
stores the indices ofneighbours
; andneighbours
could have |V|^2 size. Even though |V| is bounded by uint32, |V|^2 is not guaranteed. Hence it's possible to overflowstart_pos
at 32 bit width.