tf_vae.json empty after running vae_train.py

asolano commented 5 years ago

Greetings,

I am trying to reproduce the experiment on a DGX station I currently have access to, and the fist two steps looks alright, but the result of the command:

$ python vae_train.py
...
step 298000 35.82913 3.7688284 32.0603
step 298500 34.947067 2.9355032 32.011562
step 299000 35.83263 3.8249977 32.007633
step 299500 36.45114 4.418231 32.03291
step 300000 35.098816 3.0974069 32.001408
step 300500 35.483387 3.4664068 32.01698
step 301000 35.43274 3.4285662 32.004173

is an empty array:

$ cat tf_vae/vae.json 
[]

According to the documentation the model should be saved on that file, so any hint about where to look for the problem is appreciated.

Thanks,

Alfredo

PS: I am using the following Dockerfile to recreate the environment in the paper, in case in might be relevant:

FROM tensorflow/tensorflow:1.8.0-gpu-py3

# gym-doom requirements
RUN apt-get update && apt-get install -y --no-install-recommends \
        cmake \
        zlib1g-dev \
        libjpeg-dev \
        libboost-all-dev \
        gcc \
        libsdl2-dev \
        wget \
        unzip \
        python3-tk \
        && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# make python3 the default
RUN update-alternatives --remove python /usr/bin/python2 && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3 10

# NOTE overriding numpy version to match the paper's
# NOTE numpy==1.13.3 gives an error importing vizdoom
RUN pip install --upgrade pip && \
    pip install --no-cache-dir --user --upgrade \
        gym==0.9.4 \
        ppaquette-gym-doom==0.0.6 \
        cma==2.2.0  \
        mpi4py==2.0.0

ENTRYPOINT ["/bin/bash"]

leekwoon commented 5 years ago

Hi,

I think the problem comes from the location of

with tf.variable_scope('conv_vae', reuse=self.reuse): in __init function

I addressed this problem by moving it to _builg_graph function

def _build_graph(self):
    self.g = tf.Graph()
    with self.g.as_default():
      with tf.variable_scope('conv_vae', reuse=self.reuse):

asolano commented 4 years ago

Thanks for your suggestion, @leekwoon.

I no longer have access to the DGX station but I tried the change in a AWS instance. For now it looks good, the vae.json file was generated.

Do you have a fork or pull request to check out for other necessary changes before continuing the training? I suspect there may be more troubles ahead.

asolano commented 4 years ago

FWIW, I did find that step 3 of the GPU jobs showed some problems with the patched code so after a bit of failed troubleshooting I just decided to go back to a commit around the time the paper was published (c0cb2de) and try again. Everything worked as expected without changing any code. 👍

hardmaru commented 4 years ago

Thanks for the testing, @asolano. Maybe I should just roll back the code to that time...

hardmaru / WorldModelsExperiments

tf_vae.json empty after running vae_train.py #31