hnyu / seditor

Code release for the paper "Towards Safe Reinforcement Learning with a Safety Editor Policy", Yu et al., arXiv 2022
13 stars 1 forks source link

Unable to Visualize in Tensorboard #3

Closed chrismartel closed 1 year ago

chrismartel commented 1 year ago

Hi, I am trying to visualize the SEditor training results on tensorboard.

I am running the following command to train my agent on PointGoal1

python3.7 -m alf.bin.train --root_dir=~/research/safe_rl/seditor/results/PointGoal1/ --conf ~/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py --conf_param="create_environment.env_name='Safexp-PointGoal1-v0'"

When I stop the process, I get the following messages:

I0621 14:00:16.835314 140342896809792 checkpoint_utils.py:301] Checkpoint 'ckpt-7220' is saved successfully. I0621 14:00:16.835704 140342896809792 parallel_environment.py:168] Closing all processes. I0621 14:00:16.836166 140342896809792 parallel_environment.py:171] All processes closed.

Then I navigate to my log_dir directory

cd ~/research/safe_rl/seditor/results/PointGoal1/ 

and I have the following files

alf alf_config.py py_train.INFO py_train.research-vm.azureuser.log.INFO.20230621-133519.11758 py_train.research-vm.azureuser.log.INFO.20230621-133535.11766 py_train.research-vm.azureuser.log.INFO.20230621-133646.11778 py_train.research-vm.azureuser.log.INFO.20230621-134112.11894 py_train.research-vm.azureuser.log.INFO.20230621-134534.12004 py_train.research-vm.azureuser.log.INFO.20230621-135809.12244 train

On a different terminal I run tensorboard --logdir=~/research/safe_rl/seditor/results/PointGoal1/ --port 6006

On my browser, all Tensorboard sections are empty, including the 'Time Series' section which I assume is the section where I should be able to see the training plots?

image

Could you please provide more details about how you proceed to visualize your training results using tensorboard? Am I using the correct directory?

Note that I get the following errors when starting Tensorflow which all seem to be related to the fact that I don't use a GPU. It seems like I can ignore those but maybe that could be the issue?

2023-06-21 14:17:34.463019: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/azureuser/.mujoco/mujoco210/bin 2023-06-21 14:17:34.463063: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-06-21 14:17:35.255055: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/azureuser/.mujoco/mujoco210/bin 2023-06-21 14:17:35.255218: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/azureuser/.mujoco/mujoco210/bin 2023-06-21 14:17:35.255246: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-06-21 14:17:35.994619: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/azureuser/.mujoco/mujoco210/bin 2023-06-21 14:17:35.994664: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-06-21 14:17:35.994697: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (research-vm): /proc/driver/nvidia/version does not exist

hnyu commented 1 year ago

Hi @chrismartel , under the 'train' subdirectory, do you seen any file name that starts with 'events'? In a normal case, you should see the training curves while the job is training, provided that there is a TF events file written to 'train'.

chrismartel commented 1 year ago

@hnyu Here is the content of my train directory

algorithm events.out.tfevents.1687355139.research-vm.12004.0 events.out.tfevents.1687354610.research-vm.11778.0 events.out.tfevents.1687355894.research-vm.12244.0 events.out.tfevents.1687354877.research-vm.11894.0

I have a few 'events' files

While training, I a getting some throughput:

I0621 20:36:41.685676 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2079 time=4.068 throughput=2013.92 I0621 20:36:45.756153 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2080 time=4.070 throughput=2012.77 I0621 20:36:49.782186 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2081 time=4.026 throughput=2034.98 I0621 20:36:53.883124 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2082 time=4.100 throughput=1997.81 I0621 20:36:57.960362 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2083 time=4.077 throughput=2009.43 I0621 20:37:02.038107 140667155978048 policy_trainer.py:434] /home/azureuser/research/safe_rl/seditor/alf/alf/examples/safety/seditor/seditor_safety_gym_conf.py -> results: 2084 time=4.077 throughput=2009.17

but nothing is displayed in Tensorboard:

image

However, I can see information about the runs in the 'Text' section

image
hnyu commented 1 year ago

This is a weird problem. Can you check if the event files are non-empty? Also, usually there should be only 1 event file in the TB dir. It seems that you repeatedly launched training jobs using the same TB dir. Please delete the dir and retrain.

One possible reason why this tensorboard is empty is that no summary data was actually written. It could happen if the first summary step has not been reached yet. By default, there are only 100 summary events over the entire training process. You can try increasing 'num_summaries' in sac_safety_gym_conf.py.

chrismartel commented 1 year ago

Increasing the number of summaries worked, thank you!