Closed chrismartel closed 1 year ago
Hi @chrismartel , under the 'train' subdirectory, do you seen any file name that starts with 'events'? In a normal case, you should see the training curves while the job is training, provided that there is a TF events file written to 'train'.
@hnyu Here is the content of my train directory
algorithm events.out.tfevents.1687355139.research-vm.12004.0 events.out.tfevents.1687354610.research-vm.11778.0 events.out.tfevents.1687355894.research-vm.12244.0 events.out.tfevents.1687354877.research-vm.11894.0
I have a few 'events' files
While training, I a getting some throughput:
I0621 20:36:41.685676 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2079 time=4.068 throughput=2013.92 I0621 20:36:45.756153 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2080 time=4.070 throughput=2012.77 I0621 20:36:49.782186 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2081 time=4.026 throughput=2034.98 I0621 20:36:53.883124 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2082 time=4.100 throughput=1997.81 I0621 20:36:57.960362 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2083 time=4.077 throughput=2009.43 I0621 20:37:02.038107 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2084 time=4.077 throughput=2009.17
but nothing is displayed in Tensorboard:
However, I can see information about the runs in the 'Text' section
This is a weird problem. Can you check if the event files are non-empty? Also, usually there should be only 1 event file in the TB dir. It seems that you repeatedly launched training jobs using the same TB dir. Please delete the dir and retrain.
One possible reason why this tensorboard is empty is that no summary data was actually written. It could happen if the first summary step has not been reached yet. By default, there are only 100 summary events over the entire training process. You can try increasing 'num_summaries' in sac_safety_gym_conf.py
.
Increasing the number of summaries worked, thank you!
Hi, I am trying to visualize the SEditor training results on tensorboard.
I am running the following command to train my agent on PointGoal1
When I stop the process, I get the following messages:
Then I navigate to my log_dir directory
and I have the following files
On a different terminal I run
tensorboard --logdir=~/research/safe_rl/seditor/results/PointGoal1/ --port 6006
On my browser, all Tensorboard sections are empty, including the 'Time Series' section which I assume is the section where I should be able to see the training plots?
Could you please provide more details about how you proceed to visualize your training results using tensorboard? Am I using the correct directory?
Note that I get the following errors when starting Tensorflow which all seem to be related to the fact that I don't use a GPU. It seems like I can ignore those but maybe that could be the issue?