nsidn98 / InforMARL

Code for our paper: Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation
https://nsidn98.github.io/InforMARL/
MIT License
91 stars 22 forks source link

torch error #16

Closed LaPluma030 closed 4 months ago

LaPluma030 commented 5 months ago

OSError: [WinError 1455] 页面文件太小,无法完成操作。 Error loading "D:\Anaconda\envs\InforMARL\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.

find this error when running python -u onpolicy/scripts/train_mpe.py --use_valuenorm --use_popart --project_name "informarl" --env_name "GraphMPE" --algorithm_name "rmappo" --seed 0 --experiment_name "informarl" --scenario_name "navigation_graph" --num_agents 3 --collision_rew 5 --n_training_threads 1 --n_rollout_threads 32 --num_mini_batch 1 --episode_length 25 --num_env_steps 200000 --ppo_epoch 10 --use_ReLU --gain 0.01 --lr 7e-4 --critic_lr 7e-4 --user_name "marl" --use_cent_obs "False" --graph_feat_type "relative" --auto_mini_batch_size --target_mini_batch_size 32

nsidn98 commented 5 months ago

Can you show me the torch and python version you are using?

LaPluma030 commented 5 months ago

Can you show me the torch and python version you are using? i'm using torch1.8.1+cu111 as the requirement.txt said and python 3.9

LaPluma030 commented 5 months ago

I think this issue is probably due to insufficient memory, don't know if there is an operation to read a lot of data when running the train_mpe.py. If so, is there a way to limit the memory cost of the program? thx

Yu-zx commented 5 months ago

I also encountered this problem when using pycahrm on Windows, but there was no such problem when I used the Linux system. Is the reason for this being too many memory computing resources?

nsidn98 commented 5 months ago

The memory compute resources being limited could be an issue. I did not face any issues while running the code on linux and MacOS.

One thing that you could try is to reduce the number of rollout threads --n_rollout_threads 2 and check if the code is executable in windows OS. Although, this will make the training quite slow, it is worth to check if number of parallel processes being high is the issue.

LaPluma030 commented 5 months ago

ok, I'll try this on linux

LaPluma030 commented 5 months ago

Traceback (most recent call last): File "/mnt/InforMARL/onpolicy/scripts/train_mpe.py", line 315, in main(sys.argv[1:]) File "/mnt/InforMARL/onpolicy/scripts/train_mpe.py", line 289, in main runner = Runner(config) File "/mnt/InforMARL/onpolicy/runner/shared/graph_mpe_runner.py", line 24, in init super(GMPERunner, self).init(config) File "/mnt/InforMARL/onpolicy/runner/shared/base_runner.py", line 79, in init from onpolicy.algorithms.graph_mappo import GR_MAPPO as TrainAlgo File "/mnt/InforMARL/onpolicy/algorithms/graph_mappo.py", line 8, in from onpolicy.algorithms.graph_MAPPOPolicy import GR_MAPPOPolicy File "/mnt/InforMARL/onpolicy/algorithms/graph_MAPPOPolicy.py", line 7, in from onpolicy.algorithms.graph_actor_critic import GR_Actor, GR_Critic File "/mnt/InforMARL/onpolicy/algorithms/graph_actor_critic.py", line 9, in from onpolicy.algorithms.utils.gnn import GNNBase File "/mnt/InforMARL/onpolicy/algorithms/utils/gnn.py", line 6, in import torch_geometric File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_geometric/init.py", line 4, in import torch_geometric.data File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_geometric/data/init.py", line 1, in from .data import Data File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_geometric/data/data.py", line 9, in from torch_sparse import SparseTensor File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_sparse/init.py", line 14, in torch.ops.load_library(importlib.machinery.PathFinder().find_spec( File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch/_ops.py", line 104, in load_library ctypes.CDLL(path) File "/root/anaconda3/envs/InforMARL/lib/python3.9/ctypes/init.py", line 374, in init self._handle = _dlopen(self._name, mode) OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory

I tried running train_mpe.py on linux, and got the error message above, have you ever encountered the problem? It seems to happen when tring to import torch_geometric

LaPluma030 commented 5 months ago

Traceback (most recent call last): File "/mnt/InforMARL/onpolicy/scripts/train_mpe.py", line 315, in main(sys.argv[1:]) File "/mnt/InforMARL/onpolicy/scripts/train_mpe.py", line 289, in main runner = Runner(config) File "/mnt/InforMARL/onpolicy/runner/shared/graph_mpe_runner.py", line 24, in init super(GMPERunner, self).init(config) File "/mnt/InforMARL/onpolicy/runner/shared/base_runner.py", line 79, in init from onpolicy.algorithms.graph_mappo import GR_MAPPO as TrainAlgo File "/mnt/InforMARL/onpolicy/algorithms/graph_mappo.py", line 8, in from onpolicy.algorithms.graph_MAPPOPolicy import GR_MAPPOPolicy File "/mnt/InforMARL/onpolicy/algorithms/graph_MAPPOPolicy.py", line 7, in from onpolicy.algorithms.graph_actor_critic import GR_Actor, GR_Critic File "/mnt/InforMARL/onpolicy/algorithms/graph_actor_critic.py", line 9, in from onpolicy.algorithms.utils.gnn import GNNBase File "/mnt/InforMARL/onpolicy/algorithms/utils/gnn.py", line 6, in import torch_geometric File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_geometric/init.py", line 4, in import torch_geometric.data File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_geometric/data/init.py", line 1, in from .data import Data File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_geometric/data/data.py", line 9, in from torch_sparse import SparseTensor File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch_sparse/init.py", line 14, in torch.ops.load_library(importlib.machinery.PathFinder().find_spec( File "/root/anaconda3/envs/InforMARL/lib/python3.9/site-packages/torch/_ops.py", line 104, in load_library ctypes.CDLL(path) File "/root/anaconda3/envs/InforMARL/lib/python3.9/ctypes/init.py", line 374, in init self._handle = _dlopen(self._name, mode) OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory

I tried running train_mpe.py on linux, and got the error message above, have you ever encountered the problem? It seems to happen when tring to import torch_geometric

by the way, my cuda version is 10.2

LaPluma030 commented 5 months ago

I've solved the problems above, thanks

Yu-zx commented 5 months ago

How did you solve it? Did you install the cuda gpu? If so, can you take a look?

Yu-zx commented 5 months ago

ERROR: Could not find a version that satisfies the requirement sip<4.20,>=4.19.4 (from pyqt5) (from versions: 5.0.0, 5.0.1, 5.1.0, 5.1.1, 5.1.2, 5.2.0, 5.3.0, 5.4.0, 5.5.0, 6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.1.0, 6.1.1, 6.2.0, 6.3.0, 6.3.1, 6.4.0, 6.5.0, 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.7.0, 6.7.1, 6.7.2, 6.7.3, 6.7.4, 6.7.5, 6.7.6, 6.7.7, 6.7.8, 6.7.9, 6.7.10, 6.7.11, 6.7.12, 6.8.0, 6.8.1, 6.8.2, 6.8.3) ERROR: No matching distribution found for sip<4.20,>=4.19.4 Have you encountered this kind of problem?

LaPluma030 commented 5 months ago

How did you solve it? Did you install the cuda gpu? If so, can you take a look?

In my case, it's not the problem of cuda, I reinstalled torch-geometric and torch-sparse then it works

LaPluma030 commented 5 months ago

ERROR: Could not find a version that satisfies the requirement sip<4.20,>=4.19.4 (from pyqt5) (from versions: 5.0.0, 5.0.1, 5.1.0, 5.1.1, 5.1.2, 5.2.0, 5.3.0, 5.4.0, 5.5.0, 6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.1.0, 6.1.1, 6.2.0, 6.3.0, 6.3.1, 6.4.0, 6.5.0, 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.7.0, 6.7.1, 6.7.2, 6.7.3, 6.7.4, 6.7.5, 6.7.6, 6.7.7, 6.7.8, 6.7.9, 6.7.10, 6.7.11, 6.7.12, 6.8.0, 6.8.1, 6.8.2, 6.8.3) ERROR: No matching distribution found for sip<4.20,>=4.19.4 Have you encountered this kind of problem?

i haven't, it seems to be the problem of PyQt, maybe you can try other versions

Yu-zx commented 5 months ago

Can you use conda list to send all the installed package versions?

nsidn98 commented 5 months ago

@Yu-zx, have you tried the following for installing torch-geometric?

TORCH="1.8.0"
CUDA="cu102"
pip install --no-index torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html --user
pip install --no-index torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html --user
pip install torch-geometric --user

And have you checked this out for pyqt?

Let me know if any of these work for you.

Yu-zx commented 4 months ago

Thanks

nsidn98 commented 4 months ago

Closing this assuming the issue has been resolved. Please re-open if the issue still persists.