schwartzlab-methods / NEST

Implementation of NEST method
GNU General Public License v3.0
3 stars 1 forks source link

RuntimeError: numel: integer multiplication overflow #2

Open iaaka opened 6 months ago

iaaka commented 6 months ago

Hi again, I'm trying to apply NEST to my data, 3 out of five training runs has failed at the very beginning with:

Traceback (most recent call last): File "/nfs/cellgeni/pasham/code/nest/NEST/run_NEST.py", line 74, in DGI_model = train_NEST(args, data_loader=data_loader, in_channels=num_feature) File "/nfs/cellgeni/pasham/code/nest/NEST/CCC_gat.py", line 170, in train_NEST pos_z, neg_z, summary = DGI_model(data=data) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.10/site-packages/torch_geometric/nn/models/deep_graph_infomax.py", line 53, in forward pos_z = self.encoder(*args, *kwargs) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/nfs/cellgeni/pasham/code/nest/NEST/CCC_gat.py", line 86, in forward x, attention_scores, attention_scores_unnormalized = self.conv(data.x, data.edge_index, edge_attr=data.edge_attr, return_attention_weights = True) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/nfs/cellgeni/pasham/code/nest/NEST/GATv2Conv_NEST.py", line 223, in forward edge_index, edge_attr = remove_self_loops( File "/usr/local/lib/python3.10/site-packages/torch_geometric/utils/loop.py", line 113, in remove_self_loops edge_index = edge_index[:, mask] RuntimeError: numel: integer multiplication overflow

Parameters for two other successful runs were different only by run_id, so the problem looks stochastic.

anne04 commented 6 months ago

Thank you for pointing out this 'integer multiplication overflow' error. Seems like results occupy too much memory. Could you please try with more memory to see if it resolves? Also, may I know how many nodes and edges in your input graph?

Btw, you can manually set the seeds using these two parameters during running the model: --manual_seed and --seed https://github.com/schwartzlab-methods/NEST/blob/main/vignette/workflow.md#run-nest-to-generate-ccc-list

iaaka commented 6 months ago

Just to elaborate a bit, in my experience majority of runs fail, some with error mentioned above, some with another one:

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) /usr/local/lib/python3.10/site-packages/torch/nn/functional.py:1956: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead. warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.") Traceback (most recent call last): File "/nfs/cellgeni/pasham/code/nest/NEST/run_NEST.py", line 74, in DGI_model = train_NEST(args, data_loader=data_loader, in_channels=num_feature) File "/nfs/cellgeni/pasham/code/nest/NEST/CCC_gat.py", line 170, in train_NEST pos_z, neg_z, summary = DGI_model(data=data) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.10/site-packages/torch_geometric/nn/models/deep_graph_infomax.py", line 53, in forward pos_z = self.encoder(*args, *kwargs) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/nfs/cellgeni/pasham/code/nest/NEST/CCC_gat.py", line 86, in forward x, attention_scores, attention_scores_unnormalized = self.conv(data.x, data.edge_index, edge_attr=data.edge_attr, return_attention_weights = True) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/nfs/cellgeni/pasham/code/nest/NEST/GATv2Conv_NEST.py", line 225, in forward edge_index, edge_attr = add_self_loops( File "/usr/local/lib/python3.10/site-packages/torch_geometric/utils/loop.py", line 487, in add_self_loops loop_attr = compute_loop_attr( # File "/usr/local/lib/python3.10/site-packages/torch_geometric/utils/loop.py", line 766, in compute_loop_attr return scatter(edge_attr, col, 0, num_nodes, fill_value) File "/usr/local/lib/python3.10/site-packages/torch_geometric/utils/_scatter.py", line 79, in scatter count.scatteradd(0, index, src.new_ones(src.size(dim))) RuntimeError: Expected index [703] to be smaller than self [1581] apart from dimension 0 and to be smaller size than src [0]

To get 5 runs finished successfully I had to launch it about 40 times.

iaaka commented 6 months ago

Hi @anne04, thank you for replay, The gpu I used have 80G of memory, I don't think we have a large one, the graph has 405794 edges. I actually have set seed to 1 in all runs, but just realized that probably I also have to set --manual_seed=yes to this to have effect? that is not clear from manual.

anne04 commented 6 months ago

80GB GPU sounds good enough for 405794 edges. I can run about [5000 nodes + about 1 million edges] with one 32GB GPU and 16 CPUs each having 30GB memory. I am not sure whether that code segment in /usr/local/lib/python3.10/site-packages/torch_geometric/utils/loop.py --- uses CPU memory or GPU memory. Therefore, I would suggest to allocate more CPU memory as well if you are not using enough.

Yes, we have to use --manual_seed='yes'. I apologize for not making it clear in the tutorial. I have added that parameter in the vignette now.

Also, setting the same seed for all runs is not recommended. If one fails to find a good minima, that means all will fail as all have the same seed. That is why, we set different seeds for different runs and then ensemble in the post-processing step to maximize the chance of finding a good minima.

iaaka commented 6 months ago

I see, thank you for reply. It is not RAM issue, the job used just 277MB out of 30G requested.