6회차 연구지도 (11/3)

jh0shin commented 2 years ago

진행사항

benchmark.py의 kernel 함수별 연산 수행 결과가 일치하지 않는 원인을 파악하고 해결할 것

jh0shin commented 2 years ago

kernel 연산 결과의 불일치 통합

기존 pyg와 dgl의 kernel 함수 호출 결과가 동일하지 않는 문제가 발생, sparse matrix의 density를 1로 설정하고 dense matrix를 identity matrix로 선언하여 연산 결과를 확인하였을 때 pyg kernel과 dgl kernel의 연산 결과는 대각 방향으로 대칭이었다. 따라서 DGLGraph 생성 시 parameter로 전달되는 coo matrix의 row와 col을 서로 바꾸었을 경우 pyg kernel과 동일한 연산 결과를 반환하는 것을 확인할 수 있었다.

op = getattr(ops, 'u_mul_e_sum')
_coo = spmat.tocoo()
row, col, data = _coo.row, _coo.col, _coo.data
_g = dgl.from_scipy(sparse.coo_matrix((data, (col, row))), eweight_name='w').to('cuda')
with timer("DGL spmm"):
    for _ in range(ITER): dgl_out = op(_g, mat, _g.edata['w'])

jh0shin commented 2 years ago

PyG using dgl kernel

5주차에서 진행했던 코드를 동일한 연산 결과를 반환하도록 수정한 후 재실행한 결과이다.

# Before

# Profiling result
$ python3 arxiv/baseline_timer2.py --platform pyg --dataset ogbn-arxiv
Using backend: pytorch
Accuracy: 0.6690
--- Timer summary -----------------------------------------------
  Event                          |  Count | Average time |  Frac.
- bias                           |    303 |     0.00049s |   1.6%
- gcn_norm                       |    303 |     0.00729s |  24.3%
- message_and_aggregate          |    303 |     0.01141s |  38.1%
- mul                            |    303 |     0.00078s |   2.6%
- propagate                      |    303 |     0.01151s |  38.4%
-----------------------------------------------------------------

# After
op = getattr(ops, 'u_mul_e_sum')

_src = src.to_scipy('coo')
row, col, data = _src.row, _src.col, _src.data
_g = dgl.from_scipy(sparse.coo_matrix((data, (col, row))), eweight_name='w').to('cuda')

return op(_g, other, _g.edata['w'])

# Profiling result
$ python3 arxiv/baseline_timer2.py --platform pyg --dataset ogbn-arxiv
Using backend: pytorch
Accuracy: 0.6644
--- Timer summary -----------------------------------------------
  Event                          |  Count | Average time |  Frac.
- bias                           |    303 |     0.00053s |   0.7%
- gcn_norm                       |    303 |     0.00755s |   9.3%
- message_and_aggregate          |    303 |     0.05530s |  68.3%
- mul                            |    303 |     0.00077s |   1.0%
- propagate                      |    303 |     0.05543s |  68.5%
-----------------------------------------------------------------

매우 유사한 accuracy를 보이나, torch_sparse.matmul.matmul에서 SparseTensor를 scipy.coo.coo_matrix로 변환하고, 이를 DGLGraph로 변환하는 과정에서 시간이 더 소요되는 것으로 보인다.

jh0shin commented 2 years ago

DGL using PyG kernel

5주차에서 진행했던 코드를 동일한 연산 결과를 반환하도록 수정한 후 재실험한 결과이다.

# Before

# Profiling result
$ python3 arxiv/baseline_timer2.py --platform dgl --dataset ogbn-arxiv
Using backend: pytorch
Accuracy: 0.5917
--- Timer summary -----------------------------------------------
  Event                          |  Count | Average time |  Frac.
- activation                     |    303 |     0.00044s |   3.1%
- bias                           |    303 |     0.00046s |   3.2%
- degree                         |    606 |     0.00045s |   6.2%
- etc                            |    606 |     0.00003s |   0.4%
- expand_as_pair                 |    303 |     0.00001s |   0.1%
- fn.copy_src                    |    303 |     0.00001s |   0.1%
- mul                            |    606 |     0.00052s |   7.2%
- shape                          |    606 |     0.00001s |   0.1%
- th.matmul                      |    303 |     0.00077s |   5.3%
- th.pow                         |    606 |     0.00007s |   0.9%
- th.reshape                     |    606 |     0.00002s |   0.2%
- update_all; fn.sum             |    303 |     0.00458s |  31.5%
-----------------------------------------------------------------

# After
device = torch.device('cuda')
g = graph.adj(ctx=device).coalesce()
_indice = g.indices()
_tmp = _indice[0].clone().detach()
_indice[0] = _indice[1]
_indice[1] = _tmp
_value = g.values()
_coo = torch.sparse_coo_tensor(_indice, _value)
_g = SparseTensor.from_torch_sparse_coo_tensor(_coo).to(device)
z = ts.matmul(_g, x, 'add')

# Profiling result
$ python3 arxiv/baseline_timer2.py --platform dgl --dataset ogbn-arxiv
Using backend: pytorch
Accuracy: 0.5909
--- Timer summary -----------------------------------------------
  Event                          |  Count | Average time |  Frac.
- activation                     |    303 |     0.00045s |   1.7%
- bias                           |    303 |     0.00046s |   1.7%
- degree                         |    606 |     0.00047s |   3.5%
- etc                            |    606 |     0.00004s |   0.3%
- expand_as_pair                 |    303 |     0.00001s |   0.0%
- fn.copy_src                    |    303 |     0.00001s |   0.0%
- mul                            |    606 |     0.00051s |   3.9%
- shape                          |    606 |     0.00001s |   0.1%
- th.matmul                      |    303 |     0.00077s |   2.9%
- th.pow                         |    606 |     0.00007s |   0.5%
- th.reshape                     |    606 |     0.00001s |   0.1%
- update_all; fn.sum             |    303 |     0.01636s |  61.8%
-----------------------------------------------------------------

5주차에서의 결과보다 더욱 유사한 accuracy를 보이는 것을 확인할 수 있었다. 연산 과정에서 DGLGraphIndex를 torch.sparse_tensor로 변환한 후 row와 col을 바꿔 다서 SparseTensor로 변환하여 pyg kernel에 전달하기까지의 과정에서 시간이 더 소요되었다.

jh0shin commented 2 years ago

Kernel 함수 단순 연산 시간 측정

각 커널에 대해 matrix size, density를 다르게 하여 100회 연산을 진행한 경우 그 수행 시간을 측정했습니다.

arxiv dataset
- sparse matrix : 169343*169343 size, 2484941개의 nnz, 0.00866%의 density
- dense matrix : 169343*128 or 169343*40

Matrix size에 따른 각 kernel의 평균 수행 시간

density = 0.001, random seed = 34, iteration = 100

Matrix Size	Dense	DGL	PyG	DGL (copy_u_sum)
5000	0.01239	0.00046	0.00017	0.00026
10000	0.04967	0.00067	0.00030	0.00039
50000	OOM	0.00583	0.00335	0.00380
100000	OOM	0.02110	0.01498	0.01665
200000	OOM	0.07345	0.06929	0.05362

Density에 따른 각 kernel의 수행 시간

matrix size = 100000, random seed = 34, iteration = 100

Density	DGL	PyG	DGL (copy_u_sum)
0.0001	0.00293	0.00230	0.00197
0.0005	0.01001	0.00774	0.00699
0.001	0.01902	0.01510	0.01352
0.005	0.08829	0.06798	0.06570
0.01	OOM	OOM	OOM

jh0shin / Graph-Neural-Network