kaesve / muzero

A clean implementation of MuZero and AlphaZero following the AlphaZero General framework. Train and Pit both algorithms against each other, and investigate reliability of learned MuZero MDP models.
MIT License
148 stars 24 forks source link

Missing `/out/MuZeroOut/board\r_temp.pth.tar` after backpropagation #4

Closed sherpal closed 3 years ago

sherpal commented 3 years ago

Hello,

I try to launch the training for "hex" on my machine. The command I'm using is

python Main.py train -c Configurations/ModelConfigs/MuzeroBoard.json --game hex --gpu 1

I haven't touched anything in the configuration, so there are the ones from master.

The 50 self play iterations run successfully, then the 100 iterations of the back-propagation as well. However, after it finishes, I get the following error:

Traceback (most recent call last):
  File "Main.py", line 202, in <module>
    switch[content.algorithm](game, content, run_name)
  File "Main.py", line 86, in learnM0
    c.learn()
  File "C:\Users\antoi\projects\muzero\Coach.py", line 175, in learn
    self.opponent_net.load_checkpoint(folder=self.args.checkpoint, filename='temp.pth.tar')
  File "C:\Users\antoi\projects\muzero\MuZero\MuNeuralNet.py", line 243, in load_checkpoint
    raise FileNotFoundError(f"No MuZero Representation Model in path {representation_path}")
FileNotFoundError: No MuZero Representation Model in path ./out/MuZeroOut/board\r_temp.pth.tar

Indeed, the files I have in that folder are the following:

25/02/2021  08:11             1.231 boardgames_Hex_hex_20210225-081121.json
25/02/2021  08:39                97 checkpoint
25/02/2021  08:39         1.834.891 checkpoint_0.pth.tar.examples
25/02/2021  08:39         2.437.259 decoder_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               930 decoder_temp.pth.tar.index
25/02/2021  08:39         2.378.703 d_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39             1.046 d_temp.pth.tar.index
25/02/2021  08:39         1.379.677 p_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               715 p_temp.pth.tar.index
25/02/2021  08:39         2.437.116 r_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               878 r_temp.pth.tar.index

Here are the versions of the libs I use:

I'm running on Windows 10 with CUDA 11 and, if it matters, a GTX1070 as GPU.

joeryjoery commented 3 years ago

Hi @sherpal, I've also encountered this issue multiple times. The cause is that the model checkpoints are saved in multiple files, as indicated by the .data-XXX...

This is new behaviour from tensorflow > 2.1 which we did not test. A quick fix would be to downgrade to:

sherpal commented 3 years ago

Indeed, downgrading (almost) worked.

There was still a catch with the h5py package which made a breaking change in its 3.x version, and hence I hit the following issue: https://github.com/keras-team/keras/issues/14265 But

pip uninstall h5py
pip install h5py==2.10

fixed it.