Metrics visualization during training and evaluation

omron-sinicx / neural-astar

Official implementation of "Path Planning using Neural A* Search" (ICML-21)

https://omron-sinicx.github.io/neural-astar

Other

236 stars 57 forks source link

Metrics visualization during training and evaluation #15

Closed GigSam closed 1 year ago

GigSam commented 1 year ago

On the "minimal" branch, differently from what was done in the example.ipynb file of the previous version of the repo (the one without pytorch lightning, similar to the branch "3-improve-repository-organization"), it seems that you don't use the logs of the Opt, Exp and Hmean metrics when the training is performed. I would like to visualize those metrics, but the "metrics" folder isn't created by running the train.py script. Thank you for your support.

yonetaniryo commented 1 year ago

Hi, thank you for your post!

If you open a tensorboard you can see the progress of those metrics (p_opt, p_exp, and h_mean) as shown here: https://github.com/omron-sinicx/neural-astar/issues/4#issuecomment-1356944700 is this what you are looking for?

luigidamico100 commented 1 year ago

Hi 😀 I am facing the same issue, I am not able to see the training progress (loss and metrics) because no log files are generated. Is this normal?

Thank you!

yonetaniryo commented 1 year ago

Thank you! At least when working on https://github.com/omron-sinicx/neural-astar/pull/9, all the metrics were logged as intended. Will look into it.

yonetaniryo commented 1 year ago

Hi! I've been investigating this issue but am having difficulty reproducing it. If I clone the repository, create venv, and run train.py, the metrics were logged on tb as follows.

My environment is with:

WSL2 (Ubuntu 20.04) on Windows 11
venv created with python==3.8
tensorboard==2.11.0
pytorch-lightning==1.8.5.post0

I will try other envs and module versions, but would it be possible to share your environment and versions of related modules (maybe tb and ptl versions may affect?) that cause this logging issue? or did you get any warning messages for logging failures? @GigSam @luigidamico100

Thank you!

GigSam commented 1 year ago

@yonetaniryo my environment is:

Windows 11
venv created with python==3.10.9
tensorboard==2.10.1
pytorch-lightning==1.8.5.post0

The problem is that by cloning the repo, creating and activating venv and running train.py i don't see any "metrics" folder nor any log produced by the training file, even if the algorithm works fine and no warning for logging is produced. I really don't know what's causing this issue.

yonetaniryo commented 1 year ago

Thank you for sharing your environment. Just wanted to make sure that the logs are stored in model/mazes_032_moore_c8/lightning_logs/version_* for mazes_032_moore_c8, not in metrics. Also we have the only checkpoint in model/mazes_032_moore_c8/lightning_logs/version_0 on github to reduce the repository size. When you clone the repo and start the training, the following dir and files should appear:

model/mazes_032_moore_c8/lightning_logs/version_1:
checkpoints  events.out.tfevents....  hparams.yaml

yonetaniryo commented 1 year ago

I have checked the logging in the environment as close as that of @GigSam with python3.10.9 and tensorboard==2.10.1 used. However I'm not yet able to reproduce the issue. Can you double check if the logs are stored in model dir? Or you may try using our Dockerfile that will give us exactly the same environment. Thank you!

yonetaniryo commented 1 year ago

Sorry but I’m going to close this issue because I cannot reproduce the logging problem. If someone encounters the same problem please check if the metrics data are stored in model directory. And please don’t hesitate to re-open the issue if you can reproduce the problem. Thank you for the report!