Incomplete history files in google colab (free version)

ai-honzik commented 4 years ago

Originally discussed also here.

When running simulations on google colab (using free version), most of the history files are incomplete (eg. they're missing material description). My guess was that this has something to do with CUDA's printf buffer.

This happens even with stable simulation runs, here's what the history file looks like:

1 GPU found.
=== set device to 0 for 1 simulations ===

Total GPU memory 11996954624 bytes.
Set GPU heap size to be 5998477312 bytes.
;<<<>>>
<<<Step3154 Time:1.721566>>>2.2,0.0,-2.8,13.9,0.00,-0.12,0.00,-3.6,-3.6,-3.6,4.5,3.6,3.6,1,0.0,;11.0,0.0,-1.9,2.1,0.00,0.02,0.00,-4.5,-3.6,-3.6,4.4,3.6,3.6,1,0.0,;19.7,0.0,-2.0,9.3,0.00,-0.08,0.00,-4.4,-3.6,-3.6,4.6,3.6,3.6,1,0.0,;29.4,0.0,0.1,6.3,0.00,-0.05,0.00,-5.1,-5.0,-5.0,5.0,5.0,5.0,2,0.0,;40.4,0.0,0.7,3.0,0.00,0.03,0.00,-6.0,-6.4,-6.4,6.4,6.4,6.4,3,0.0,;<<<>>>
... (output here is ok)
<<<Step5491 Time:2.996780>>>9.1,0.0,-0.8,0.1,0.00,0.00,0.00,-4.7,-4.7,-4.7,3.5,4.7,4.7,1,0.0,;16.0,0.0,-1.7,0.6,0.00,0.01,0.00,-3.5,-4.7,-4.7,3.2,4.7,4.7,1,0.0,;22.5,0.0,-1.5,10.5,0.00,-0.09,0.00,-3.2,-4.7,-4.7,3.5,4.7,4.7,1,0.0,;30.8,0.0,0.1,7.5,0.00,-0.07,0.00,-4.9,-5.0,-5.0,5.0,5.0,5.0,2,0.0,;40.8,0.0,0.3,1.3,0.00,-0.01,0.00,-5.1,-5.3,-5.3,5.3,5.3,5.3,3,0.0,;<<<>>>
[0;34m0) Simulation 0 ends: bot.vxd Time: 3.000054, angleSampleTimes: 15.
[0mRunning simulation locally by default.
./vx3_node_worker -i workspace/locally/20200722144311.vxt -o workspace/locally/20200722144311.vxr

Here's bot and base used for one example experiment. bot.txt base.txt

We also have a google colab setup if that helps you.

skriegman commented 4 years ago

Please send the script or commands you used to run this robot

ai-honzik commented 4 years ago

This is our google colab script:

!git clone https://github.com/voxcraft/voxcraft-sim.git; cd voxcraft-sim/;

print("Source code downloaded.")

!cd voxcraft-sim; rm build -rf; mkdir build; cd build; cmake ..; make -j 10;

print("Executables built.")

!git clone https://github.com/ai-honzik/sr2020.git

!cd sr2020; cp ../voxcraft-sim/build/voxcraft-sim ../voxcraft-sim/build/vx3_node_worker . && echo "Basic setup done"

#count how much time is left to run the experiment
import time, psutil
uptime = time.time() - psutil.boot_time()
remain = 11*60*60 - uptime #for how many seconds should the experiment run?

if remain < 0:
  raise Exception("Not enough time to run the experiment!")

with open( 'sr2020/time.left', 'w' ) as f:
  f.write( str( int( remain ) ) )

#run the experiment
!cd sr2020 && rm *.dat; timeout "$(cat time.left)" python experiment.py

#zip the experiment folder
!zip -r sr2020.zip sr2020
#download experiment files
from google.colab import files
files.download('sr2020.zip')

skriegman commented 4 years ago

Please send the script (experiment.py) you used to run this robot

ai-honzik commented 4 years ago

Here's the experiment.py (it also corresponds to what is in the repo).

#!/usr/bin/env python3

from evosorocore2.Simulator import default_sim
from evosorocore2.Environment import default_env
from Utils.fitness import Distance
from Utils.VXA import VXA
from Utils.tools import create_folders,file_stream_handler,use_checkpoint
from Simulation_Manager import SimulationManager as SM
import map_elites.cvt as cvt_map_elites
import map_elites.common as cm_map_elites
import numpy as np
import random
import time
import logging

if __name__ == "__main__":

  #use checkpoint, eg. "200717134521"
  checkpoint = None

  seed = int( time.time() )

  random.seed( seed )
  np.random.seed( seed )

  number_of_materials = 3
  mult_arr = np.array( [ 1e6, 5, 1, 1e6, 0.1,
                         1e7, 5, 1, 1e6, 0,
                         1e6, 5, 1, 1e6, -0.1 ] )
  exp_folder = "./experiment_data"
  robot_folder = "./demo"
  logfile = "simulation.log"

  #create experiment folders
  dirs = create_folders( exp_folder, checkpoint )

  #create logger
  logger = logging.getLogger( __name__ )
  f,s = file_stream_handler( dirs["experiment"] + "/" + logfile )
  logger.addHandler( f )
  logger.addHandler( s )
  logger.setLevel( logging.DEBUG )
  logger.info( ''.join( ['-'] * 30 ) )

  #save seed and inform about using checkpoint
  logger.info( "Using seed: {0}".format( seed ) )
  if checkpoint:
    logger.info( "Using {} as a checkpoint, using seed above for next runs may not matter."\
                  .format( checkpoint ) )

  #simulator and environment parameters
  sim = default_sim.copy()
  sim["DtFrac"] = 0.5
  sim["RecordStepSize"] = 100
  sim["StopConditionFormula"] = 3
  env = default_env.copy()
  env["TempEnabled"] = True
  env["VaryTempEnabled"] = True
  env["TempAmplitude"] = 5 #14.4714
  env["TempPeriod"] = 0.2
  vxa = VXA( sim, env )

  dist_fit = Distance( dirs["simulator"], dirs["experiment"] + "/" + logfile ) #fitness function based on distance
  simulation = SM( number_of_materials, dist_fit.fitness, robot_folder, dirs["experiment"],\
                   mult_arr, vxa, log=dirs["experiment"] + "/" + logfile )

  #map elites parameters
  px = cm_map_elites.default_params.copy()
  px["parallel"] = False #voxcraft-sim may allocate quite a bit of memory for one simulation
  px["batch_size"] = 20
  px["random_init_batch"] = 20
  px["dump_period"] = 10 #if batch size is bigger, it will be used as a dump_period instead
  px["random_init"] = 0.7

  #create map elites instance (or use cached one)
  if checkpoint:
    logger.info("Loading cached Map Elites instance")
    ME, last_run = use_checkpoint( exp_folder, checkpoint )
    logger.info("Using cached Map Elites instance")
    simulation.sim_run = last_run + 1
  else:
    logger.info("Creating new Map Elites instance")
    #TODO perhaps simulator could store dim_map and dim_x?
    ME = cvt_map_elites.mapelites( 2, 5*number_of_materials, n_niches=25,
                                   max_evals=500, params=px,
                                   exp_folder=dirs["mapelites"] + "/" )

  #run map elites
  logger.info("Running Map Elites now")
  ME.compute( simulation.fitness, log_file=open(dirs["mapelites"] + "/cvt.dat", 'w' ),
              sim_log_f=dirs["experiment"] + "/" + logfile )

ai-honzik commented 4 years ago

Just to explain what we're doing for now: we're trying to generate some interesting materials for given morphology and see which one does the robot behave best with.

Also, I don't think experiment.py really tells what we're doing exactly. We are running voxcraft-sim in SimulationManager.py:


  def run_simulator( self ):
    """
    @output: data from simulation
    Run voxcraft simulation and get data out of it.  
    """

    if self.logging:
      self.logging.info( "Running simulation #{0}".format( self.sim_run ) )

    while True: #taken from voxcraft-evo
      try:
        #TODO for vx3_node_worker when file exists (too quick simulation runs)
        #TODO formatting?
        sub.call( "./voxcraft-sim -i {0} -o {1}/sim_run{2}.xml -f > {1}/sim_run{2}.history"\
        #sub.call("cp ./dummy.xml {1}/sim_run{2}.xml"\
                  .format( self.folder_bot, self.folder_exp_data + "/simdata", self.sim_run ), shell=True ) #shell=True shouldn't be normally used
        break
      except IOError:
        if self.logging:
          self.logging.warning("IOError, resimulating.")
        pass
      except IndexError:
        if self.logging:
          self.logging.warning("IndexError, resimulating.")
        pass

skriegman commented 4 years ago

Please send the history file. Are you attempting to save more than one simulation's history in a single history file?

ai-honzik commented 4 years ago

Here's the history file: sim_run0.txt

ai-honzik commented 4 years ago

No, each simulation is separate (one simulation, one history file).

skriegman commented 4 years ago

ok so the problem is that it starts at step 3154? correct? https://youtu.be/QsBOUvRxY28

ai-honzik commented 4 years ago

Yes, it also misses {setting}, so you can't really play the history file in viz.

liusida commented 4 years ago

I think @ai-honzik might be right, it is because the output has been cut due to some buffer size limit. Though I am not sure which piece of code cut it. I run the base.txt and rob.txt on my laptop, and it produces the full history file.

https://youtu.be/7sXHs0yf5Ks

skriegman commented 4 years ago

@ai-honzik have you experienced this issue outside google colab (free version)?

ai-honzik commented 4 years ago

@skriegman I have not, but @jrieffel has.

The reason we're using google colab is that some of us do not have nvidia gpu

ai-honzik commented 4 years ago

@liusida I wouldn't say there is a problem with the code but when the output gets flushed.

jrieffel commented 4 years ago

I have replicated the problem on my (non-colab) GPU too

skriegman commented 4 years ago

@ai-honzik https://github.com/voxcraft/voxcraft-sim/blob/master/readme_before_reporting_issues.md

davidmatthews1uvm commented 4 years ago

I ran the robot file provided on google colab (Tesla T4). I also ran the files on my GPU (GTX 1650 Super). In both cases printing of time steps began at step 0 and ended at step 5491. I compiled commit 0a0512c585bfd61017ed20804596070dc279c675.

I recommend you try to compile in debug mode and use CUDA-gdb to further debug this.

jrieffel commented 4 years ago

I have been able to replicate this on my desktop (NVIDIA Corporation GM107GL [Quadro K1200] (rev a2)), and on a colab notebook. I'm trying now on a colab-pro notebook (with better GPU) and will let you know the results.

voxcraft / voxcraft-sim

Incomplete history files in google colab (free version) #25