pytorch / ELF

ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation
Other
3.36k stars 567 forks source link

Updated ELF still returning exceeded memory error #107

Open downseq opened 5 years ago

downseq commented 5 years ago

Using two-gtp: https://www.mankier.com/1/gogui-twogtp

System Spec: 150GB RAM TESLA V100 GPU

After 2.5 games, using the following settings:

./gtp.sh ~/v1.bin --gpu 0 --num_block 20 --dim 224 --mcts_puct 1.5 --batchsize 2 --mcts_rollout_per_batch 2 --mcts_threads 2 --mcts_rollout_per_thread 250 --resign_thres 0.00 --mcts_virtual_loss 1

From log:

[2018-10-17 17:46:47.497] [elf::ai::tree_search::MCTSAI_T-22] [info] [-1] MCTSAI Result: BestA: [B9][bi][191], MaxScore: 3, Info: -2.97157/3 (-0.990524), Pr: 0.0101511, child node: 21109 Action: 191 MCTS: 1239.9ms. Total: 1239.9ms. B<< B<< = B9 B<< B<< W>> play B B9 W<< W<< = W<< W<< W>> genmove w slurmstepd: error: Job [ommitted] exceeded memory limit (153936196 > 153600000), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB [ommitted] ON [ommitted] CANCELLED AT 2018-10-18T04:46:48

Could this be a two-gtp side error? as i'm not sure how to self play without two-gtp using just the ELF system for competitive play (not self-play training).

100 #94

jma127 commented 5 years ago

Using the command you provided, I observe that memory usage reaches an asymptote of roughly 2.5 GB.

Not sure whether twogtp would cause any issues. cc @qucheng

downseq commented 5 years ago

I found a work around by running games individually so that it resets memory usage, might confirm later if it was two-gtp related if I get some spare time.

For those that are interested this is the command that I ran to work around it (you can adjust number of games etc depending on how much memory is being used up):

#!/bin/bash
BLACK="player_b.sh"
WHITE="player_w.sh"
for i in {1..50}
do ./gogui-twogtp -black "$BLACK" -white "$WHITE" -games 1 \
  -size 19 -sgffile game_filename_$i -auto -verbose -debugtocomment -komi 7.5
 done
qucheng commented 5 years ago

twogtp will just be 2 copies of ELF, which will consume ~5G. Might exceed max mem on some hardware.

downseq commented 5 years ago

For experimental reasons, would it be possible to direct me to where the code is to remove the entire subtree created by the AI after each move (not just the unused portion)?

yuandong-tian commented 5 years ago

@downseq remove --persistent_tree would clean up the existing tree before each move. See here: https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/mcts.h#L142

downseq commented 5 years ago

@downseq remove --persistent_tree would clean up the existing tree before each move. See here: https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/mcts.h#L142

Thanks for the confirmation.

So it seems keeping the subtree after each move is not set as default as it was in the AlphaGo Zero paper?

yuandong-tian commented 5 years ago

@downseq It is always helpful. If the memory allows, this should always help in boosting the performance with zero additional cost. So why not?

downseq commented 5 years ago

I think there is still some confusion as to whether it is on by default or not, but it seems like it is left on as default, so maybe I have misinterpreted your comment earlier.