Closed tao123322 closed 2 years ago
I have not used a distributed running program, how should I modify it?
Hi, thanks for your problems. For the problem 2 and 3, they are caused by the problem 1. The files mentioned in the problem 2 and 3 will be generated when the program can run successfully. For the problem 1, it seems like something wrong with the distributed running. This might be caused by environment problems or other problems. I think you can first check whether your machine has GPU id 4,5, which I use in the script. If your machine does not have GPU 4,5, you can specify the GPU id you use in the script. Also, to modify the program to a single GPU version, just go to cmds/20/motif/predcls/semi/em_E_step1.sh and modify CUDA_VISIBLE_DEVICES=4,5 to CUDA_VISIBLE_DEVICES=0 and --nproc_per_node=2 to --nproc_per_node=1.
When running sh cmds/20/motif/predcls/semi/em_E_step1.sh command,the following three problems occurred:
1、File "/home/tao/anaconda3/envs/scene_graph_benchmark/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized 2、Traceback (most recent call last): File "score.py", line 4, in
l = pickle.load(open("raw_em_E.pk", "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'raw_em_E.pk'
3、Traceback (most recent call last):
File "cut_off.py", line 8, in
score = json.load(open("score.json", "r"))
FileNotFoundError: [Errno 2] No such file or directory: 'score.json'
How to solve the first problem, and where to find the files for the second and third problems.