Errors occur while calling NodePropPredDataset in dataset.py

LuozyCS commented 1 year ago

I can run your code correctly on small dataset by using scripts in run.sh and get similar results within the paper but when I'm trying to reproduce nodeformer on large graph dataset, it comes out an error on both amazon2m and ogb proteins dataset.

Traceback (most recent call last):
  File "main-batch.py", line 43, in <module>
    dataset = load_dataset(args.data_dir, args.dataset, args.sub_dataset)
  File "/home/workspace/NF/NodeFormer/dataset.py", line 102, in load_dataset
    dataset = load_amazon2m_dataset(data_dir)
  File "/home/workspace/NF/NodeFormer/dataset.py", line 308, in load_amazon2m_dataset
    ogb_dataset = NodePropPredDataset(name='ogbn-products', root=f'{data_dir}/ogb')
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/nodeproppred/dataset.py", line 63, in __init__
    self.pre_process()
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/nodeproppred/dataset.py", line 111, in pre_process
    additional_node_files = self.meta_info['additional node files'].split(',')
AttributeError: 'float' object has no attribute 'split'

This error occurs in dataset.py while calling the NodePropPredDataset functiuon : https://github.com/qitianwu/NodeFormer/blob/64d26581f571340ab750ce6e60a8bb524e22e726/dataset.py#L290 https://github.com/qitianwu/NodeFormer/blob/64d26581f571340ab750ce6e60a8bb524e22e726/dataset.py#L306

I tried to fix this error and ran into the implementation of NodePropPredDataset by changing the 'float' object into 'str':

additional_node_files = str(self.meta_info['additional node files']).split(',')

It passed, but another error comes out ：

Loading necessary files...
This might take a while.
Traceback (most recent call last):
  File "main-batch.py", line 43, in <module>
    dataset = load_dataset(args.data_dir, args.dataset, args.sub_dataset)
  File "/home/workspace/NF/NodeFormer/dataset.py", line 98, in load_dataset
    dataset = load_proteins_dataset(data_dir)
  File "/home/workspace/NF/NodeFormer/dataset.py", line 268, in load_proteins_dataset
    ogb_dataset = NodePropPredDataset(name='ogbn-proteins', root=f'{data_dir}/ogb')
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/nodeproppred/dataset.py", line 63, in __init__
    self.pre_process()
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/nodeproppred/dataset.py", line 137, in pre_process
    self.graph = read_csv_graph_raw(raw_dir, add_inverse_edge = add_inverse_edge, additional_node_files = additional_node_files, additional_edge_files = additional_edge_files)[0] # only a single graph
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/io/read_graph_raw.py", line 83, in read_csv_graph_raw
    temp = pd.read_csv(osp.join(raw_dir, additional_file + '.csv.gz'), compression='gzip', header = None).values
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/common.py", line 753, in get_handle
    handle = gzip.GzipFile(  # type: ignore[assignment]
  File "/root/anaconda3/envs/nodeformer/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '../data//ogb/ogbn_proteins/raw/nan.csv.gz'

I don't know what‘s happening...... If you need more information, please let me know.

System Info

WSL2 Ubuntu 20.04LTS
anaconda3
python 3.8.16
torch 1.9.0+cu111
torch-cluster 1.5.9
torch-geometric 1.7.2
torch-sparse 0.6.12
torch-scatter 2.0.9
torch-spline-conv 1.2.1
ogb 1.3.1
numpy 1.22.4
networkx 2.6.1
scipy 1.6.2
scikit-learn 1.1.3

qitianwu commented 1 year ago

Hi, have you checked the environement packages are all consistent with the requirement.txt? And, did you use your own preprocessed ogb data or the data downloaded by the codes. The error seems that the version of the loaded ogb datasets is inconsistent with the environment where the code is running

LuozyCS commented 1 year ago

I ran the code in anaconda and I use conda list checked that the environment packages are all consistent with the requirement.txt. But the terminal remind me that ：
```
WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.6.
```
Is this what you mean ’the version of the loaded ogb datasets is inconsistent with the environment where the code is running‘?
I use the dataset downloaded by the codes.

qitianwu commented 1 year ago

I see, have you tried removed the preprocessing files under the files of the ogb dataset and run the code again? It could be caused by the inconsistent version of the preprocessed fileds.

Also, you can check the correctness of the dataset path '../data//ogb/ogbn_proteins/raw/nan.csv.gz'. It seems there is a redundant '/'

LuozyCS commented 1 year ago

Thank you for your quick response.

you mean only leave the files in raw folder? I tried remove files in other folders except 'raw', but the same error still occurs below. Btw, I'm currently using the fix method aforementioned at the beginning: 'changing the 'float' object into 'str' in NodePropPredDataset functiuon' , and I still don't know if that's the right thing to do.

About the redundant '/'

WARNING:root:The OGB package is out of date. Your version is 1.3.1, while the latest version is 1.3.6.
Namespace(K=5, M=50, batch_size=10000, cached=False, cpu=False, data_dir='../data/', dataset='ogbn-proteins', device=1, directed=False, dropout=0.0, epochs=1000, eval_step=9, gat_heads=8, gpr_alpha=0.1, hidden_channels=64, hops=1, jk_type='max', knn_num=5, label_num_per_class=20, lamda=0.1, lp_alpha=0.1, lr=0.01, method='nodeformer', metric='rocauc', model_dir='../model/', num_heads=1, num_layers=3, num_mlp_layers=1, out_heads=1, projection_matrix_type=True, protocol='semi', rand_split=False, rand_split_class=False, rb_order=1, rb_trans='identity', runs=5, save_model=False, seed=42, sub_dataset='', tau=0.25, train_prop=0.5, use_act=True, use_bn=True, use_gumbel=True, use_jk=True, use_residual=True, valid_prop=0.25, weight_decay=0.0)
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/proteins.zip
Downloaded 0.21 GB: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 216/216 [00:24<00:00,  8.66it/s]
Extracting ../data//ogb/proteins.zip
Loading necessary files...
This might take a while.
Traceback (most recent call last):
File "main-batch.py", line 43, in <module>
dataset = load_dataset(args.data_dir, args.dataset, args.sub_dataset)
File "/home/workspace/NF/NodeFormer/dataset.py", line 98, in load_dataset
dataset = load_proteins_dataset(data_dir)
File "/home/workspace/NF/NodeFormer/dataset.py", line 268, in load_proteins_dataset
ogb_dataset = NodePropPredDataset(name='ogbn-proteins', root=f'{data_dir}/ogb')
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/nodeproppred/dataset.py", line 63, in __init__
self.pre_process()
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/nodeproppred/dataset.py", line 135, in pre_process
self.graph = read_csv_graph_raw(raw_dir, add_inverse_edge = add_inverse_edge, additional_node_files = additional_node_files, additional_edge_files = additional_edge_files)[0] # only a single graph
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/ogb/io/read_graph_raw.py", line 83, in read_csv_graph_raw
temp = pd.read_csv(osp.join(raw_dir, additional_file + '.csv.gz'), compression='gzip', header = None).values
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
self.handles = get_handle(
File "/root/anaconda3/envs/nodeformer/lib/python3.8/site-packages/pandas/io/common.py", line 753, in get_handle
handle = gzip.GzipFile(  # type: ignore[assignment]
File "/root/anaconda3/envs/nodeformer/lib/python3.8/gzip.py", line 173, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '../data//ogb/ogbn_proteins/raw/nan.csv.gz'

The dataset path is correct for small graph dataset, and I didn't change the floder's path while running the large graph script, cuz the whole process is done automatically.

Do you have any idea about the file named 'nan.csv.gz' ? I can't find it in the dataset I downloaded.

qitianwu commented 1 year ago

Hi, sorry for the late response due to the paper submission ddl. I think there might be some issues in your ogb package version. I just checked the files under my data file folder and it does not contain nan.csv.gz

./ogb/ogbn_proteins/raw

./ogb/ogbn_proteins/processed

LuozyCS commented 1 year ago

Thank you for your suggestion. I will further find out what causes bug and get back with an update comment once I've resolved the issue in a few days.

Good luck for your new paper!

qitianwu commented 1 year ago

Thank you and hopefully you can find the bug soon.

LuozyCS commented 1 year ago

I tried to setup the enviroment again with ogb1.3.1 and I successfully ran your code on ogbn-proteins and amazon2m. It's weird that last time I time I setup on a WSL environment but failed with ogb1.3.1, so this time I tried on a linux one, not sure whether it's the reason. Anyway, thanks to your suggestions.

Btw, I meet a few troubles when I setup again. And I have a suggestion that if someone is in trouble with the installation of torch geometric/sparse/scatter, you can try installing in order torch_scatter==2.0.7,torch_sparse==0.6.10,torch_geometric==1.7.2.

Again, thank you for your help!

qitianwu commented 1 year ago

Glad to hear that you resolved the issue! Indeed, the installation of the PyG package is dependent on the torch_scatter and torch_sparse. And, the versions of these packages should stay strictly consistent, otherwise there could be some wired bugs.

qitianwu / NodeFormer

Errors occur while calling NodePropPredDataset in dataset.py #8

System Info