Custom dataset preprocessing gives same num_edges and num_train

lwwlwwl commented 1 year ago

Hi, I'm following the custom dataset node classification tutorial and trying to preprocess Friendster dataset but it gives the following dataset.yaml (it's oom killed before finishing running on a 312GB RAM machine so last few lines are still -1)

dataset_dir: ~/raw/
num_edges: 1806067135
num_nodes: 65608366
num_relations: 1
num_train: 1806067135
num_valid: -1
num_test: -1
node_feature_dim: -1
rel_feature_dim: -1
num_classes: -1
initialized: false

I reused the preprocess() function given in the tutorial. The raw files used are

node-label.csv (65608366 lines: each line has one randomly generated integer in the range of 32)
node-feat.csv (65608366 lines: each line has 256 zeros in the format of '0,0,0,...')
edge.csv (1806067135 lines in the format of src_id,dst_id)
train.csv (52486692 lines with each line to be one node_id; node_ids are renumbered to be consecutive integers from 0 to num_nodes-1)
test.csv (6560838 lines with each line to be one node_id)
valid.csv (6560836 lines with each line to be one node_id)

I am wondering why num_train = num_edges in this case. Thanks!

rogerwaleffe commented 1 year ago

Thanks for the question! I'm not immediately sure why this is the behavior you are seeing. The fact that num_train = num_edges doesn't worry me too much because the edges are processed first and num_train is set to num_edges by default. The num_train value is normally updated at the end of preprocessing for node classification, but I'm guessing you are seeing the OOM before this is supposed to occur, hence the output above.

In terms of the memory usage, ~65M nodes with 256 features should only be about 66GB, so there should be sufficient room on your machine. I'm guessing some file is not being processed/read in as expected?

Can you print the shapes of the train_nodes/valid_nodes/test_nodes/features/labels as well as send the program output so we can pinpoint a little bit better what is being read in and where the OOM is occurring. Also, I take it your dataset download function does nothing if you already have the files?

lwwlwwl commented 1 year ago

Thanks for your reply. Yes, I commented out the call to download() in the main function. Here are the first few lines of my preprocess() and the corresponding terminal output. It seems that it went oom when reading features.

train_nodes = np.genfromtxt(self.input_train_nodes_file, delimiter=",").astype(np.int32)
valid_nodes = np.genfromtxt(self.input_valid_nodes_file, delimiter=",").astype(np.int32)
test_nodes = np.genfromtxt(self.input_test_nodes_file, delimiter=",").astype(np.int32)
print('shape of train_nodes: ', train_nodes.shape)
print('shape of valid_nodes: ', valid_nodes.shape)
print('shape of test_nodes: ', test_nodes.shape)
features = np.genfromtxt(self.input_node_feature_file, delimiter=",").astype(np.float32)
print('shape of features: ', features.shape)
labels = np.genfromtxt(self.input_node_label_file, delimiter=",").astype(np.int32)
print('shape of labels: ', labels.shape)

~$ python3 preprocess.py 
shape of train_nodes:  (52486692,)
shape of valid_nodes:  (6560836,)
shape of test_nodes:  (6560838,)
Killed

rogerwaleffe commented 1 year ago

Great, thanks! I would double check your node-feat.csv file to make sure it's as expected. Otherwise it's possible that genfromtxt is using a lot of memory for some reason. Since your features are all zeros, you could try setting features = np.zeros((65608366, 256), dtype=np.float32) and see if everything else goes through.

lwwlwwl commented 1 year ago

Manually setting the feature seems to work around that but still gave oom after reading all files. The terminal output is as following:

shape of features:  (65608366, 256)
shape of labels:  (65608366,)
shape of train_nodes:  (52486692,)
shape of valid_nodes:  (6560836,)
shape of test_nodes:  (6560838,)
Killed

rogerwaleffe commented 1 year ago

Okay sounds good. Can you share the full code of the preprocess function with the print statements and the error message so we can get a bit more context for where the new error/OOM is occurring?

It's possible that the error is now occurring in the converter.convert() since I believe you are just passing train_edges=self.input_edge_list_file to the constructor. This causes the converter to use Pandas/strings, which we have seen leads to very high memory allocation in the past. If this is the issue, we may be able to fix it by first reading the edges manually in the preprocess function, converting them to a [num_edges, 2] numpy array, and then passing this in to the train_edges in the converter.

lwwlwwl commented 1 year ago

I changed train_edges=self.input_edge_list_file to be train_edges = np.genfromtxt(self.input_edge_list_file, delimiter=",").astype(np.int32) but still seeing oom. Here is the full code and the new terminal output for both with and without splits specification.

shape of train_nodes:  (52486692,)
shape of valid_nodes:  (6560836,)
shape of test_nodes:  (6560838,)
shape of features:  (65608366, 256)
shape of labels:  (65608366,)
Killed

rogerwaleffe commented 1 year ago

Thanks! I looked at the code and tried to run it myself, but was having issues with np.genfromtxt for the train edges. I switched to using pandas and this seemed to be considerably faster and I didn't have an OOM issue.

There is another potential problem with the current preprocessing however, which is that of the node IDs. The train edges file has 1806067135 edges with 65608366 unique node IDs. The node IDs themselves however are not contiguous from 0 to 65608366-1. The min node ID is 101 and the max is 124836179. As such, unless the train/valid/test node ID lists correspond to actual node IDs in the train edges file, the preprocessing could run into issues (you should ensure this is the case, call it case A, if you would like to use some preprocessing functionality in our converter like partitioning/sequential_train_nodes). For example, setting the train nodes to IDs from 0 to 52486692-1 (call it case B) will cause problems because node IDs 0, 1, 2, etc. don't actually exist in the train edges.

What you really want to do for this dataset is map the node IDs in the train edges such that they are in fact indexed from 0 to 65608366-1. Our converter will do this correctly, so long as it is not instructed by the known_node_ids argument that there are additional unique node IDs to worry about. Setting known_node_ids=[train_nodes, valid_nodes, test_nodes] should be done if setting A (above) is true, however if setting B is the case, then known_node_ids=None should work. After the train_edges have been remapped, it is then sufficient for case B to set the train nodes to be IDs from 0 to 52486692-1. Remapping of these train nodes for case B is no longer required as the remapping has been done implicitly. For case A, our remap_nodes function will handle remapping such that the train_nodes etc. are mapped to match how the edges were mapped.

Below is a preprocess function that I used to successful preprocess friendster. It should work as is for in memory training as it assumes case B. You can modify it based on the above if you would like to switch to case A. Hope this addresses your issue!

def preprocess(self, num_partitions=1, remap_ids=True, splits=None, sequential_train_nodes=False, partitioned_eval=False):
    num_nodes = 65608366
    num_train = 52486692
    num_valid = 6560836
    num_test = 6560838
    train_nodes = np.arange(num_train, dtype=np.int32)
    valid_nodes = np.arange(num_train, num_train+num_valid, dtype=np.int32)
    test_nodes = np.arange(num_train+num_valid, num_train+num_valid+num_test, dtype=np.int32)
    print('shape of train_nodes: ', train_nodes.shape)
    print('shape of valid_nodes: ', valid_nodes.shape)
    print('shape of test_nodes: ', test_nodes.shape)

    features = np.zeros((num_nodes, 256), dtype=np.float32)
    print('shape of features: ', features.shape)
    labels = np.random.randint(0, 32, (num_nodes, ), dtype=np.int32)
    print('shape of labels: ', labels.shape)

    df = pd.read_csv(self.input_edge_list_file, dtype=np.int32, skiprows=4, header=None, delim_whitespace=True)
    train_edges = df.to_numpy(dtype=np.int32)
    print('shape of train_edges: ', train_edges.shape)

    # Calling the convert function to generate the preprocessed files
    converter = TorchEdgeListConverter(
        output_dir=self.output_directory,
        train_edges=train_edges,
        num_partitions=num_partitions,
        remap_ids=remap_ids, # remap_ids here should be true
        sequential_train_nodes=sequential_train_nodes,
        format="numpy",
        known_node_ids=None, # modify if needed based on case A/B
        partitioned_evaluation=partitioned_eval,
    )
    dataset_stats = converter.convert()

    # uncomment and use this if you are in case A
    # if remap_ids:
    #     node_mapping = np.genfromtxt(self.output_directory / Path(PathConstants.node_mapping_path), delimiter=",")
    #     train_nodes, valid_nodes, test_nodes, features, labels = remap_nodes(
    #         node_mapping, train_nodes, valid_nodes, test_nodes, features, labels
    #     )

    # Writing the remapped files as bin files
    with open(self.train_nodes_file, "wb") as f:
        f.write(bytes(train_nodes))
    with open(self.valid_nodes_file, "wb") as f:
        f.write(bytes(valid_nodes))
    with open(self.test_nodes_file, "wb") as f:
        f.write(bytes(test_nodes))
    with open(self.node_features_file, "wb") as f:
        f.write(bytes(features))
    with open(self.node_labels_file, "wb") as f:
        f.write(bytes(labels))

    # update dataset yaml
    dataset_stats.num_train = train_nodes.shape[0]
    dataset_stats.num_valid = valid_nodes.shape[0]
    dataset_stats.num_test = test_nodes.shape[0]
    dataset_stats.node_feature_dim = features.shape[1]
    dataset_stats.num_classes = 32

    dataset_stats.num_nodes = dataset_stats.num_train + dataset_stats.num_valid + dataset_stats.num_test

    with open(self.output_directory / Path("dataset.yaml"), "w") as f:
        yaml_file = OmegaConf.to_yaml(dataset_stats)
        f.writelines(yaml_file)

    return dataset_stats

lwwlwwl commented 1 year ago

Thanks! This works.

marius-team / marius

Custom dataset preprocessing gives same num_edges and num_train #130