Error: "generator raised StopIteration"

reczoo / RecBox

A box of core libraries for recommendation model development

Apache License 2.0

91 stars 19 forks source link

Error: "generator raised StopIteration" #3

Closed wujcan closed 3 years ago

wujcan commented 3 years ago

When I run with the following command: "cd benchmarks; python run_param_tuner.py --config Yelp18/MF_CCL_yelp18_x0/MF_CCL_yelp18_x0_tuner_config.yaml --gpu 0", I get an error at the second epoch. error_log

Also, when enable parallel in evaluate_metrics, the code will get stuck here, so I had to set parallel to False when evalution. error_log2

Another problem is the code is quite CPU consuming, I run the code on a 32G memory PC, but the memory rate become 100% during evaluation (paralle is set to False, otherwise it will get stuck as mentioned above).

Any solutions?

xpai commented 3 years ago

For problem 1, it may be caused by depency issues. What python and pytorch versions do you use?

For probelm2, the stuck is usually casued by out of memory resource. I strongly to recommend a sever with about 100G RAM to run the code. The code is optimized for servers in industrial use. If you have to use a PC with small RAM, you must make some modifications. For example, you may consider to limit the chunk size to a smaller one (Line29) and set num_workers=1 (Line27): https://github.com/xue-pai/OpenMatch/blob/master/deem/metrics.py#L29

wujcan commented 3 years ago

For problem 1, it may be caused by depency issues. What python and pytorch versions do you use?

For probelm2, the stuck is usually casued by out of memory resource. I strongly to recommend a sever with about 100G RAM to run the code. The code is optimized for servers in industrial use. If you have to use a PC with small RAM, you must make some modifications. For example, you may consider to limit the chunk size to a smaller one (Line29) and set num_workers=1 (Line27): https://github.com/xue-pai/OpenMatch/blob/master/deem/metrics.py#L29

My environment is: python=3.8.11 pytorch=1.9.1 cuda=11.1

I have run the code on a server with 128GB memory and 2080TI GPU, but it got stuck at the same position and got an error at the second epoch when setting parallel to False. Actually, I have tried to set num_workers=1, but it got stuck too.

zhujiem commented 3 years ago

Please use python 3.6.x and pytorch=1.0.x, which we have only tested on.

For problem 2: If you have 128GB memory in your case, it is not issue "getting stuck". The progress bar tracks each of the sub-process for evaluating metrics. The evaluation progress is slow, and it is normal for the progress bars to halt for about several mininues until the sub-process finishes.

zhujiem commented 3 years ago

Hi, could you rereproduce the results after fixing the problem?

wujcan commented 3 years ago

Hi, could you rereproduce the results after fixing the problem?

Yes, after I downgrade python to 3.6.X and torch to 1.9.0, it seems to work well now.

xpai commented 1 year ago

PS: The problem has been fixed for Python 3.7.