tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.71k stars 1.66k forks source link

What might be the reasons that sometimes Tensorflow can be started and sometimes not with the same code? #5575

Closed PBerit closed 2 years ago

PBerit commented 2 years ago

Hi all,

I just started to use Tensorboard. So I have a Python code (about reinforcement learning) and I run it. Then I insert the following commands into the console of Spyder.

load_ext tensorboard
tensorboard --logdir logs --host localhost --port 8155

Then I just use my browser (Firefox) and type in: http://localhost:8155

The strange thing is, that sometime Tensorboard is correctly shown in the browser, but sometimes I just get an error essage that the destination can't be reached. In that case it might help to change the port number. But this does not always solve the problem. Sometime even changing the port number 10 times does not start Tensorboard. The strange thing is that altough nothing changes in the code and nothing changes in the commands for starting Tensorboard (and the same computer and browser are also used), sometimes Tensorboard starts immediately from the first used port (8155), sometimes I have to try several port numbers before it starts, and sometimes it does not start at all.

Can anyone of you think about a possible explanation for this behaviour?

Here is the code that I use:

from stable_baselines3 import A2C
import os
import tensorflow as tf
from datetime import datetime

env = DSM_BT1_Env()
models_dir = "models/A2C"
logdir = "logs/RL_BT1"

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

if not os.path.exists(logdir):
    os.makedirs(logdir)

model = A2C('MlpPolicy', env, verbose=1, tensorboard_log=logdir)

timesteps = 10000

print("\n \n \n \n Training \n \n \n")

#train and save the model
numberOfEpisodes = 1 
for i in range(numberOfEpisodes):
    timeStarted = datetime.now().strftime("%d-%m-%Y--%H-%M-%S")
    model.learn(total_timesteps=timesteps, reset_num_timesteps=False, tb_log_name=f"A2C_Started_{timeStarted}_Episode_{i+1}_Timesteps_{timesteps}")
    model.save(f"{models_dir}/Started_{timeStarted}_Episode_{i+1}_Timesteps_{timesteps}")
pindinagesh commented 2 years ago

@PBerit

In order to expedite the trouble-shooting process, can you please provide a complete code to reproduce the issue reported here and while reproducing the issue i was getting error at env = DSM_BT1_Env() like NameError: name 'DSM_BT1_Env' is not defined . Thanks!

PBerit commented 2 years ago

@pindinagesh : Thanks for your answer pindinagesh. Actually the env = DSM_BT1_Env() is an custom OpenAI Gym environment for reinforcement learning. It has more than 800 lines of code which is why I did not want to post it. But the environment itself is not the problem. The problem is that Tensorboard sometimes starts and sometimes it does not. Today for example, it started immediately with the commands posted above (and when it has started I don't want to close it because than it might be possible that I can't start it again). I am pretty sure when I'll will have problems in future attemps to start it.

bileschi commented 2 years ago

When using TensorBoard in notebook environments (like spyder), TensorBoard attempts to reuse existing instances rather then starting a new instance every time it is invoked.

See details in https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks

The same TensorBoard backend is reused by issuing the same command. 
If a different logs directory was chosen, a new instance of TensorBoard would be opened.
Ports are managed automatically.

You may be able to address your issue by

  1. Using a new log directory to force the notebook to start a new TB instance
  2. Using your system console to kill the running version TensorBoard, in which case the notebook will start a new one.
PBerit commented 2 years ago

@bileschi : Thanks bileschi for your comment. Actually, even when I have restarted my computer sometimes Tensorflow does not start altough using exactly the same code and the same port number. What I then have to do is to enumerate through different port numbers. Sometimes it can happen that the second number is okay, but sometimes I need to enumerate through 20 port numbers. This is pretty strange and unconfortable.

PBerit commented 2 years ago

@bileschi : Thanks for your answer bileschi. Any comments to my last comment? Do you have an idea as to why I always have to enumerate through several port numbers (1 to 20), even after having restarted the computer, to start Tensorboard? I'll highly appreciate every further comment from you.

bileschi commented 2 years ago

Hi @PBerit , I'm sorry I don't really know. I suspect it may have something to do with the Spyder environment, but I would need to reproduce locally to be more confident. Unfortunately the TensorBoard team does not have the resources to guarantee support for Spyder. Do you know if you can reproduce the problem in Jupyter?

As a workaround, do you have access to the location where Spyder writes the log files? If so, you can run TensorBoard externally from the Spyder environment, from your own console, which may give you more control and stability. Another option is to try uploading to tensorboard.dev, the hosted solution.

PBerit commented 2 years ago

@bileschi : Thanks for your answer and effort bileschi. Okay, than I will just continue to enumerate through different port numbers to start Tensorboard.

PBerit commented 2 years ago

Hi all,

I tried using another IDE (PyCharm) as bileschi assumed that the problem is caused by Spyder. But this is not the case. However, when using PyCharm I get a little more information when using a portnumber which leads to the non-starting of Tensorboard. I get the output in the console: "Reusing TensorBoard on port 8111 (pid 10180), started 0:23:16 ago. (Use '!kill 10180' to kill it.) Please visit http://localhost:8111 in a web browser." This means, that I have already used this port number. Unfortunately the instructions don't work. When I type in kill 10180 I get the error message "SyntaxError: invalid syntax". When typing '!kill 10180' I get the output "'!kill 10180'" but this does not change anything (as I think the second command is treated like a string). Do you have any idea, how I can "kill" that portnumber to make it accessible for Tensorboard?

PBerit commented 2 years ago

Any comments to my last comment?

bileschi commented 2 years ago

You may be able to kill it from the terminal command line, rather than through the python notebook?

PBerit commented 2 years ago

@bileschi : Thanks for your answer bileschi. I tried what you suggested but I also get an error message:

PS C:\Users\User1\Python> !kill 4048
!kill : Die Benennung "!kill" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad korrekt ist (sofern enthalten), 
und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1
+ !kill 4048
+ ~~~~~
    + CategoryInfo          : ObjectNotFound: (!kill:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

The first error message is in German and just says that the command !kill! is not recognized as a name of a Cmdlet, a function or a script.

PBerit commented 2 years ago

@bileschi: Thanks for your answers bileschi. Any comment to my last comment? I'll highly appreciate every further comment from you.

bileschi commented 2 years ago

Responding to comment from 6 days ago, I think the issue may be that you tried the following two commands within your IDE:

kill 10180 '!kill 10180'

But the one that you want to try is

!kill 10180 (yes exclamation point, no quotes)

--

FYI I think that the analog to killing process on Unix is to use the 'taskkill' command.

https://superuser.com/questions/959364/on-windows-how-can-i-gracefully-ask-a-running-program-to-terminate

However, I'm not sure if the PyCharm IDE is going to expose processes that it starts as separate windows processes or if it keeps them all wrapped in its own runtime.

PBerit commented 2 years ago

@bileschi : Thanks bileschi for your answer and effort. I really appreciate it. Unfortunately I get an error message when using your suggested code in the PyCharm console: !kill 10180 (with exclamation point and no quotes). It says (translated) "the statement kill is either wrongly spelled or could not be found"

PBerit commented 2 years ago

@bileschi : Thanks for your answers bileschi. Any comments to my last comment? I'll highly appreciate every further comment from you.

bileschi commented 2 years ago

What is happening here is that the exclamation point prefix is issuing a command to be run in the shell (command line). The command kill exists in unix to terminate processes by id. Hence, in unix, kill 10180 would terminate the process with id 10180. Your system is running windows, which does not have the command kill. It has taskkill instead, so you should look to the manual for that command to determine how to use it. I am not a windows user, so I'm not sure how to use it correctly. It may be as simple as !taskkill /pid 10180 , but I may be missing something.

PBerit commented 2 years ago

@bileschi : Thanks for your answer. Unfortuantely your suggested command does not work.

But maybe someone else in the Forum, who has experience with Windows, can answer my question. So I tried several commands (from this website https://winaero.com/kill-process-windows-10/#Kill_a_process_using_PowerShell) that all did not work, with and without exlamation point:

!taskkill /pid 10180
!taskkill /F /pid 10180
!taskkill 10180
!taskkill pid 10180
!Stop-Process -ID 10180 -Force
!Stop-Process -pid 10180 -Force

Would anyone mind telling me how to kill the process such that I can start Tensorboard again on the same port? I'll appreciate every comment.

PBerit commented 2 years ago

Any comments on my last comment? Can anyone help me on how to kill a process in Windows 10 (more specifically how to kill the TensorBoard process that blocks a specific port). I'll highly appreciate every further comment.

PBerit commented 2 years ago

Does nobody have an idea? I'll appreciate every comment.

PBerit commented 2 years ago

Would anyone mind telling me how to kill the process such that I can start Tensorboard again on the same port? I'll appreciate every comment.

bileschi commented 2 years ago

Perhaps StackOverflow can help you with your question about finding a running process and killing it from the command line?

PBerit commented 2 years ago

I just wanted to mention that in fact the inital problem (sometimes Tensorboard can be started and sometimes not) is related to the use of Tensorboad in Windows. So it is also a problem of Tensorboard itself. But there is a workaround that you can see here: https://stackoverflow.com/questions/59563025/how-to-reset-tensorboard-when-it-tries-to-reuse-a-killed-windows-pid/59582163#59582163. Thanks bileschi for your great help and the advice to ask this on StackOverflow.

bileschi commented 2 years ago

You may be able to kill it from the terminal command line, rather than through the python notebook?

On Wed, Apr 20, 2022 at 5:51 AM PBerit @.***> wrote:

Hi all,

I tried using another IDE (PyCharm) as bileschi assumed that the problem is caused by Spyder. But this is not the case. However, when using PyCharm I get a little more information when using a portnumber which leads to the non-starting of Tensorboard. I get the output in the console: "Reusing TensorBoard on port 8111 (pid 10180), started 0:23:16 ago. (Use '!kill 10180' to kill it.) Please visit http://localhost:8111 in a web browser." This means, that I have already used this port number. Unfortunately the instructions don't work. When I type in kill 10180 I get the error message "SyntaxError: invalid syntax". When typing '!kill 10180' I get the output "'!kill 10180'" but this does not change anything (as I think the second command is treated like a string). Do you have any idea, how I can "kill" that portnumber to make it accessible for Tensorboard?

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorboard/issues/5575#issuecomment-1103726905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEFSTUWULYMVNUENKXX22LVF7HQDANCNFSM5PBOIROA . You are receiving this because you were mentioned.Message ID: @.***>

-- Stan Bileschi Ph.D. | SWE | @.*** | 617-230-8081