Open velpc opened 3 years ago
Thank you so much! they are very timely and helpful. Can you provide information on how to generate an xx_split_idx.pkl file from the dataset and its storage format?
The xx_split_idx.pkl
stores indexes of how to split the original large graph dataset in the HDF5 format into train/validation/test sets. It is a dictionary with keys "train"
, "val"
, and "test"
, where each value is a list of graph id numbers in the corresponding subset. We use this to split the original single graph dataset into separate storage for train/validation/test in preprocessing as in here. For our dataset, these splits are randomly generated based on the total number of graphs in the dataset with ratio 8:1:1 for train/validation/test, and fixed thereafter for community use.
Thanks for the clean datasets! One issue I have regarding the data specification:
graph_data_storage.md specifies x
as node signals/features, but I can't find these in any of the hdf5 files. Furthermore, README.md suggests these are featureless graphs. Can you clarify?
Yes x
stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x
(this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.
Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!
Thanks @jzhou316, that makes perfect sense - but it might be a nice addition to graph_data_storage.md :).
cool. I'll add some details
The detailed instructions are very helpful. How to set _numevils and _num_evilsavg if our problem is for multiclassification but not biclassification (evil/non-evil)?
@velpc These are dataset statistics stored in the HDF5 file (and may not be used by the model). For different specific problems such as multiclassification, you can write your own data following our format with your other dataset attributes. For example, you could have attributes such as "num_class_0" "num_class_1" "num_class_2" etc. to describe the dataset. We have some example code of writing these attributes here. Hope this answers your question!
Hi @jzhou316, is there a much much smaller dataset that can be used for quick testing of the algorithm? I wanted to try out with a smaller subset without having to download these ones specified on dataset_botnet.py
file. Thanks
@helmoai Sorry that we currently don't have an official mini dataset for quick testing. Could you download the data and take out a subset (e.g. a few graphs) to run the mini-test? Otherwise I could generate a smaller subset from one of the datasets for you.
Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).
Found an issue in the code you gave, to read the hdf5 files here. I think you missed the h5py.File
when opening the file. It should be:
import h5py
with h5py.File('filename', "r") as f:
e = f['0']['edge_index'][()] # take out the edge indexes from the first graph with id '0'
num_nodes = f['0'].attrs['num_nodes'] # access the statistics stored in attributes of the first graph with id '0'
num_graphs = f.attrs['num_graphs'] # access the statistics stored in attributes of the dataset file
@helmoai yes you are right. Thanks for pointing it out! Updated it.
In scatter_ of common.py, out (SRC, index, 0, out, dim_size, fill_value) has 6 parameters, but the display can only enter 2-5 parameters.
Yes
x
stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector inx
(this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!
I've been implementing this on a different network dataset and noticed a few gotchas related to this. If you use the botgen/
code to generate your data, it adds the dummy vector. As a result, add_nfeat_ones=True
to add it at training time causes an error. Additionally, the botgen code does not add is_directed
or self_directed
to the data, so you will need to do that manually.
Detailed explanation of hdf5 instance format of pyg, dgl, nx, or dict.