Open iaaka opened 6 months ago
Thank you for pointing out this 'integer multiplication overflow' error. Seems like results occupy too much memory. Could you please try with more memory to see if it resolves? Also, may I know how many nodes and edges in your input graph?
Btw, you can manually set the seeds using these two parameters during running the model: --manual_seed and --seed https://github.com/schwartzlab-methods/NEST/blob/main/vignette/workflow.md#run-nest-to-generate-ccc-list
Just to elaborate a bit, in my experience majority of runs fail, some with error mentioned above, some with another one:
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
/usr/local/lib/python3.10/site-packages/torch/nn/functional.py:1956: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Traceback (most recent call last):
File "/nfs/cellgeni/pasham/code/nest/NEST/run_NEST.py", line 74, in
To get 5 runs finished successfully I had to launch it about 40 times.
Hi @anne04, thank you for replay, The gpu I used have 80G of memory, I don't think we have a large one, the graph has 405794 edges. I actually have set seed to 1 in all runs, but just realized that probably I also have to set --manual_seed=yes to this to have effect? that is not clear from manual.
80GB GPU sounds good enough for 405794 edges. I can run about [5000 nodes + about 1 million edges] with one 32GB GPU and 16 CPUs each having 30GB memory. I am not sure whether that code segment in /usr/local/lib/python3.10/site-packages/torch_geometric/utils/loop.py --- uses CPU memory or GPU memory. Therefore, I would suggest to allocate more CPU memory as well if you are not using enough.
Yes, we have to use --manual_seed='yes'. I apologize for not making it clear in the tutorial. I have added that parameter in the vignette now.
Also, setting the same seed for all runs is not recommended. If one fails to find a good minima, that means all will fail as all have the same seed. That is why, we set different seeds for different runs and then ensemble in the post-processing step to maximize the chance of finding a good minima.
I see, thank you for reply. It is not RAM issue, the job used just 277MB out of 30G requested.
Hi again, I'm trying to apply NEST to my data, 3 out of five training runs has failed at the very beginning with:
Traceback (most recent call last): File "/nfs/cellgeni/pasham/code/nest/NEST/run_NEST.py", line 74, in
DGI_model = train_NEST(args, data_loader=data_loader, in_channels=num_feature)
File "/nfs/cellgeni/pasham/code/nest/NEST/CCC_gat.py", line 170, in train_NEST
pos_z, neg_z, summary = DGI_model(data=data)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_geometric/nn/models/deep_graph_infomax.py", line 53, in forward
pos_z = self.encoder(*args, *kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/nfs/cellgeni/pasham/code/nest/NEST/CCC_gat.py", line 86, in forward
x, attention_scores, attention_scores_unnormalized = self.conv(data.x, data.edge_index, edge_attr=data.edge_attr, return_attention_weights = True)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/nfs/cellgeni/pasham/code/nest/NEST/GATv2Conv_NEST.py", line 223, in forward
edge_index, edge_attr = remove_self_loops(
File "/usr/local/lib/python3.10/site-packages/torch_geometric/utils/loop.py", line 113, in remove_self_loops
edge_index = edge_index[:, mask]
RuntimeError: numel: integer multiplication overflow
Parameters for two other successful runs were different only by run_id, so the problem looks stochastic.