harvardnlp / botnet-detection

Topological botnet detection datasets and graph neural network applications
MIT License
171 stars 42 forks source link

Could you provide the input data format? #2

Open velpc opened 3 years ago

velpc commented 3 years ago

Detailed explanation of hdf5 instance format of pyg, dgl, nx, or dict.

jzhou316 commented 3 years ago

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

velpc commented 3 years ago

Thank you so much! they are very timely and helpful. Can you provide information on how to generate an xx_split_idx.pkl file from the dataset and its storage format?

jzhou316 commented 3 years ago

The xx_split_idx.pkl stores indexes of how to split the original large graph dataset in the HDF5 format into train/validation/test sets. It is a dictionary with keys "train", "val", and "test", where each value is a list of graph id numbers in the corresponding subset. We use this to split the original single graph dataset into separate storage for train/validation/test in preprocessing as in here. For our dataset, these splits are randomly generated based on the total number of graphs in the dataset with ratio 8:1:1 for train/validation/test, and fixed thereafter for community use.

jackd commented 3 years ago

Thanks for the clean datasets! One issue I have regarding the data specification:

graph_data_storage.md specifies x as node signals/features, but I can't find these in any of the hdf5 files. Furthermore, README.md suggests these are featureless graphs. Can you clarify?

jzhou316 commented 3 years ago

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

jackd commented 3 years ago

Thanks @jzhou316, that makes perfect sense - but it might be a nice addition to graph_data_storage.md :).

jzhou316 commented 3 years ago

cool. I'll add some details

velpc commented 3 years ago

The detailed instructions are very helpful. How to set _numevils and _num_evilsavg if our problem is for multiclassification but not biclassification (evil/non-evil)?

jzhou316 commented 3 years ago

@velpc These are dataset statistics stored in the HDF5 file (and may not be used by the model). For different specific problems such as multiclassification, you can write your own data following our format with your other dataset attributes. For example, you could have attributes such as "num_class_0" "num_class_1" "num_class_2" etc. to describe the dataset. We have some example code of writing these attributes here. Hope this answers your question!

iohelder commented 3 years ago

Hi @jzhou316, is there a much much smaller dataset that can be used for quick testing of the algorithm? I wanted to try out with a smaller subset without having to download these ones specified on dataset_botnet.py file. Thanks

jzhou316 commented 3 years ago

@helmoai Sorry that we currently don't have an official mini dataset for quick testing. Could you download the data and take out a subset (e.g. a few graphs) to run the mini-test? Otherwise I could generate a smaller subset from one of the datasets for you.

iohelder commented 3 years ago

Hi, we have detailed how the large graph datasets are stored in a unified hdf5 graph data format we use here. The API format of pyg, dgl, nx, or dict graph objects are created when the graphs are being loaded (check source code here).

Found an issue in the code you gave, to read the hdf5 files here. I think you missed the h5py.File when opening the file. It should be:

import h5py
with h5py.File('filename', "r") as f:
    e = f['0']['edge_index'][()]             # take out the edge indexes from the first graph with id '0'
    num_nodes = f['0'].attrs['num_nodes']    # access the statistics stored in attributes of the first graph with id '0'
    num_graphs = f.attrs['num_graphs']       # access the statistics stored in attributes of the dataset file
jzhou316 commented 3 years ago

@helmoai yes you are right. Thanks for pointing it out! Updated it.

whxuexi commented 2 years ago

In scatter_ of common.py, out (SRC, index, 0, out, dim_size, fill_value) has 6 parameters, but the display can only enter 2-5 parameters.

tillson commented 11 months ago

Yes x stores the node features. As our graphs are featureless, we do not have them in the raw data. However, the GNN algorithms need some values to operate with, in order to propagate information through the topology. We simply add a dummy all-one vector in x (this is done in our data processing when the dataset is constructed from the raw data), meaning that all the nodes are treated homogeneously and we focus on learning purely on topology for our botnet dataset.

Also note that the data format we created for large graph datasets could be easily extended with other special graph attributes based on your problems. Hope this is helpful!

I've been implementing this on a different network dataset and noticed a few gotchas related to this. If you use the botgen/ code to generate your data, it adds the dummy vector. As a result, add_nfeat_ones=True to add it at training time causes an error. Additionally, the botgen code does not add is_directed or self_directed to the data, so you will need to do that manually.