Closed wujcan closed 3 years ago
For problem 1, it may be caused by depency issues. What python and pytorch versions do you use?
For probelm2, the stuck is usually casued by out of memory resource. I strongly to recommend a sever with about 100G RAM to run the code. The code is optimized for servers in industrial use. If you have to use a PC with small RAM, you must make some modifications. For example, you may consider to limit the chunk size to a smaller one (Line29) and set num_workers=1 (Line27): https://github.com/xue-pai/OpenMatch/blob/master/deem/metrics.py#L29
For problem 1, it may be caused by depency issues. What python and pytorch versions do you use?
For probelm2, the stuck is usually casued by out of memory resource. I strongly to recommend a sever with about 100G RAM to run the code. The code is optimized for servers in industrial use. If you have to use a PC with small RAM, you must make some modifications. For example, you may consider to limit the chunk size to a smaller one (Line29) and set num_workers=1 (Line27): https://github.com/xue-pai/OpenMatch/blob/master/deem/metrics.py#L29
My environment is: python=3.8.11 pytorch=1.9.1 cuda=11.1
I have run the code on a server with 128GB memory and 2080TI GPU, but it got stuck at the same position and got an error at the second epoch when setting parallel to False. Actually, I have tried to set num_workers=1, but it got stuck too.
Please use python 3.6.x and pytorch=1.0.x, which we have only tested on.
For problem 2: If you have 128GB memory in your case, it is not issue "getting stuck". The progress bar tracks each of the sub-process for evaluating metrics. The evaluation progress is slow, and it is normal for the progress bars to halt for about several mininues until the sub-process finishes.
Hi, could you rereproduce the results after fixing the problem?
Hi, could you rereproduce the results after fixing the problem?
Yes, after I downgrade python to 3.6.X and torch to 1.9.0, it seems to work well now.
PS: The problem has been fixed for Python 3.7.
When I run with the following command: "cd benchmarks; python run_param_tuner.py --config Yelp18/MF_CCL_yelp18_x0/MF_CCL_yelp18_x0_tuner_config.yaml --gpu 0", I get an error at the second epoch.
Also, when enable parallel in evaluate_metrics, the code will get stuck here, so I had to set parallel to False when evalution.
Another problem is the code is quite CPU consuming, I run the code on a 32G memory PC, but the memory rate become 100% during evaluation (paralle is set to False, otherwise it will get stuck as mentioned above).
Any solutions?