initzhang / DUCATI_SIGMOD

Accepted paper of SIGMOD 2023, DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
13 stars 2 forks source link

Index error while running run_allocate file for UK Union dataset #8

Open snigdhas1612 opened 6 months ago

snigdhas1612 commented 6 months ago

File: /data/DUCATI_SIGMOD/run_allocate.py

I'm encountering an IndexError when running the code with the Uk Union dataset for batch sizes of 8192 and 4096. The total budget specified as params is 15-20GB onwards. Despite having 49.14 GB of available GPU memory, the code fails to execute intermittently.

` image

`

P.S. I ran into a similar issue while using Twitter dataset, but on the adjacency cache allocation. ` image

` We observed that the above issue only popped up when the allocated adj cache size was larger than total adj size. Therefore, by increasing the fake_dim parameter, we essentially reduced the adj budget ( so the total adj never fits within the adj cache). But the issue still exists for adj cache as well. Wrt UK union, the issue is with nfeat cache allocation where the total nfeat size is way bigger than the total budget or the allocated nfeat cache.

  1. Twitter: total adj size: 11.251GB, total nfeat size: 59.274GB
  2. UK union: total adj size: 42.031GB, total nfeat size: 64.717GB

    I would really appreciate if someone could shed some light on this. Thanks.

initzhang commented 3 months ago

Hi @snigdhas1612 , thanks for reporting this bug! I will fix the index problem later.

DUCATI is originally designed for gigantic graphs whose both adj size and nfeat size > GPU memory. If you set total_budget > adj size on high-end GPU like A100 (80GB), then maybe GNNLab/NextDoor would be a better fit since they have better acceleration for sampling by storing adjacency matrix purely in GPU