Loading the fixed data split from the original PyTorch Geometric dataset (masks)

Ojda22 commented 3 years ago

Hello,

I am trying to load a dataset and to keep the dataset split, as masks already exist.

I realized there exists an argument that controls this:

@staticmethod
    def pyg_to_graphs( dataset, verbose: bool = False, fixed_split: bool = False, tensor_backend: bool = False, netlib=None ) -> List[Graph]:
        r"""
        Transform a :class: torch_geometric.data.Dataset object to a 
        list of :class:deepsnap.grpah.Graph  objects.

        Args:
            dataset (:class:`torch_geometric.data.Dataset`): A 
                :class:`torch_geometric.data.Dataset` object that will be 
                transformed to a list of :class:`deepsnap.grpah.Graph` 
                objects.
            verbose (bool): Whether to print information such as warnings.
            fixed_split (bool): Whether to load the fixed data split from 
                the original PyTorch Geometric dataset.
            tensor_backend (bool): `True` will use pure tensors for graphs.
            netlib (types.ModuleType, optional): The graph backend module. 
                Currently DeepSNAP supports the NetworkX and SnapX (for 
                SnapX only the undirected homogeneous graph) as the graph 
                backend. Default graph backend is the NetworkX.

        Returns:
            list: A list of :class:`deepsnap.graph.Graph` objects.
        """

However, when I run it, it always propagates with errors. Now I'm not sure whether it is implemented until the end or it is yet to be done.

I would appreciate your help and instructions on how can I accomplish this. Best,

JiaxuanYou commented 3 years ago

Could you post more information on what you tried and what are the returned errors? Thanks

Ojda22 commented 3 years ago

I run main.py --cfg configs/example_cpu.yaml --repeat 3

By adding fixed_split=True here https://github.com/snap-stanford/GraphGym/blob/d207269ae0fbb3493fdb2f1029a96cf8b17a4849/graphgym/loader.py#L74

In order to keep Cora dataset fix split

Here is the stacktrace:

 File “~/GraphGym/run/main.py", line 42, in <module>
    datasets = create_dataset()
  File “~/GraphGym/graphgym/loader.py", line 226, in create_dataset
    datasets = dataset.split(
  File “~/anaconda3/envs/graph/lib/python3.8/site-packages/deepsnap/dataset.py", line 1079, in split
    self._split_transductive(
  File “~/anaconda3/envs/graph/lib/python3.8/site-packages/deepsnap/dataset.py", line 735, in _split_transductive
    split_graph = graph.split(
  File “~/anaconda3/envs/graph/lib/python3.8/site-packages/deepsnap/graph.py", line 1182, in split
    return self._split_node(split_ratio, shuffle=shuffle)
  File “~/anaconda3/envs/graph/lib/python3.8/site-packages/deepsnap/graph.py", line 1258, in _split_node
    graph_new.node_label = self.node_label[nodes_split_i]
IndexError: index 273 is out of bounds for dimension 0 with size 140

I'm not sure if it is the right way to go, or I'm missing something?

JiaxuanYou commented 3 years ago

Hi, if you try to set fix_split=False, do you encounter the same error? Doing so can help me find the bug, thanks!

Ojda22 commented 3 years ago

Hello

With fix_split=False (default argument) it works.

However, it doesn't work as expected. Let me try to be more precise about the problem when running GraphGym/run/main.py

A dataset with fixed size masks (e.g., Cora), the masks are ignored since the default conf is to split the dataset based on the custom ratio (default 0.8/0.2)
I tried to look for the configuration parameter in GraphGym/graphgym/config.py and to switch off this behavior and keep fixed split which is based on masks, but it seems that this parameter is not implemented.
On the other side, in the code, I found fix_split : bool argument that regulates this, but when it is switched to True, it throws the error ☝️

However, running Cora dataset by _GraphGym/run/mainpgy.py it seems that split is performed according to the masks (for those who want to keep fixed size splits)

JiaxuanYou commented 3 years ago

Hello What you describe is right, thanks for the summary.

By default, GraphGym was using DeepSNAP backend (example in main.py), which automatically assumes random splitting of datasets. When loading PyG datasets using DeepSNAP backend, the fixed splits will be discarded. That was the default behavior of DeepSNAP

I recently created a PyG backend (example in main_pyg.py), which adopts the fixed split in the PyG datasets. That backend is PyG native and does not perform conversion to DeepSNAP format.

Hope this can further clarify your questions. Please let me know if you need further help.

snap-stanford / GraphGym

Loading the fixed data split from the original PyTorch Geometric dataset (masks) #24