harvardnlp / botnet-detection

Topological botnet detection datasets and graph neural network applications
MIT License
172 stars 42 forks source link

Graph loading extremely slow? #5

Closed iamgroot42 closed 3 years ago

iamgroot42 commented 3 years ago

Is it just me, or does getting the data loaders ready take much longer than it should? I'm on a machine with 512GB RAM (most of which is free), a 40-core Xeon Silver, SSD storage. Loading the graphs (train, test, val included) takes at least 30-40 minutes. For context, loading a graph with 170K nodes and 1.1M edges (via OGB) takes less than 10 seconds.

Is it normal to take this long to load these graphs? If yes, is there any way to speed this process up (apart from setting in_memory=False, which shifts the loading part to later on, if I understand correctly)?

jzhou316 commented 3 years ago

Hi, I think it is a bit weird. I tested on my side - (assuming the dataset has been downloaded), running the following command

from botdet.data.dataset_botnet import BotnetDataset
from botdet.data.dataloader import GraphDataLoader
import time

start = time.time()
botnet_dataset_train = BotnetDataset(name='chord', split='train', graph_format='pyg')
end = time.time()
print('training data loading time: ', end - start, 'seconds')

with results

training data loading time:  6.43342924118042 seconds

There are some variations. Sometimes it is less and sometimes it could more, as much as a few minutes (mostly the first time running). I think this is due to the physical structure of the disk storage in my case.

About the graph size, the whole data (including train, val, test) contains 960 graphs, where each graph has average number of nodes around 140K and average number of edges around 1.5M, so in total it would be much larger considering it contains about 1K of these graphs. So even if with the same loading speed you see for a single graph (via OGB), with 1K of them you should expect it to take a bit more (although it should take much less than that calculation, which is the case on my end).

About the flag in_memory=False, yes you are right. If set to True (default), this would load the HDF5 file into memory all at once at the beginning (may take some time) but later in the dataloader it might be faster. If set to False, the dataloader would directly take data from the storage every time without first loading the large file into memory.

I think to debug, you might want to find the data stored at data/botnet/processed/chord_train.hdf5, and try

import deepdish as dd

data = dd.io.load('data/botnet/processed/chord_train.hdf5')

and see how long it takes.

iamgroot42 commented 3 years ago

I tried the same code on my local laptop, and it took 80 seconds instead (which is still much more than yours, but much less than what I saw on my server, so definitely a system-specific issue), so something is going on with the server storage! Thanks for all your help :)