Runtime crashes upon executing ```!fuser -k 5000/tcp``` or ```!kill <PID of port 5000>```

jrh48k commented 6 months ago

Describe the current behavior I'm a student at Missouri University of Science & Tech, using colab for a computer engineering course. Many of our assignments require us to launch web servers to emulate a blockchain. We use flask-ngrok for most of these, and once you execute the code cell to launch the webapp, you can't use keyboard commands to interrupt the execution as you normally would be able to in a terminal window. Because of this, I've tried to find other solutions to kill the process.

Using !fuser 5000/tcp returns the PID associated with the port. Either using the -k flag, or the command !kill <PID> is what I've found to try to resolve the issue. However, this crashes and restarts the runtime, which is annoying since I have to go back and run all my code again, instead of just the cell to launch the webapp.

Describe the expected behavior I expected this to just kill the process with the associated PID and allow me to re-initialize my webapp.

What web browser you are using I use Opera GX, but I tested in chrome and the same issue occurred.

Additional context [Link to a minimal, public, self-contained notebook that reproduces this issue.

Share the file using your GitHub account using File > Save a copy as a GitHub Gist.
or Share Drive notebooks using the Share button then 'Get Shareable Link'.](https://colab.research.google.com/drive/1hDqscLA82H90cAmXixAYPZK6dpxXN-cj?usp=sharing)
https://gist.github.com/jrh48k/f2e2ad3854ab30facf8dd8498e10a82a

cperry-goog commented 6 months ago

tracking internally at b/338448772

teeler commented 6 months ago

Hi there jrh48k! Since you offered up being a student we're going to take the educational route to resolving this ;) (There's a simple answer to your problem, I promise not to make this annoying)

A question back to you to nudge you in the right direction: when you run that HTTP server, which process is bound to that port?

teeler commented 6 months ago

...and a followup question, how would you identify what process ID of your running python code is? Is it the same for each cell, or different?

jrh48k commented 6 months ago

image_2024-05-02_173200016 The process ID changes every time I run the code from scratch, I can find the PID either using the !fuser 5000/tcp command, since I know it will be using a tcp connection, or I could use the netstat command from net-tools in order to find the process on that port. Netstat is also nice cause it will list all the running processes. A majority of the PIDs are different, but now that I'm looking at all of the tcp PIDs, I can see that there is another process with the exact same ID.

teeler commented 6 months ago

OK, so you know that the ID of the process that's listening on port 5000 in this case is "332/python", and you did find another process with that exact same ID, so the question now remains: what exactly is process 332? What else might it be doing?

You're on the right trail - 332 in this case is a the python "kernel" and is responsible for executing everything in that notebook - including maintaining a connection to the outside world (and ultimately to your browser). You can see this yourself, in any cell on that notebook inspect the value of os.getpid() and you'll see that it lines up with the same process that's listening on port 5000.

So when you kill the job that is listening on port 5000, you're also killing the kernel, which Colab recognizes as the runtime "crashing". The UI claims it's a crash because under normal circumstances will stay running until their supervising processes reap them.

Does that make sense?

jrh48k commented 6 months ago

I guess that's makes sense, but then why are there other python processes not associated with the kernel running? And if killing the kernel is what crashes the program, is there any way to circumvent it so I can free up the port whenever I want to, thus allowing me to relaunch the web app without having to restart the run time?

teeler commented 6 months ago

why are there other python processes not associated with the kernel running?

Excellent question - there are various other processes running that enable us to run the entire service, in addition to the standard Jupyter notebook architecture (https://docs.jupyter.org/en/latest/projects/architecture/content-architecture.html has some details).

is there any way to circumvent it so I can free up the port whenever I want to, thus allowing me to relaunch the web app without having to restart the run time?

Absolutely but it's not super trivial. You have a few options:

Close the socket. Your code isn't actually performing the socket operations, werkzeug is, and you're running it in a thread with no way to communicate with that thread. Inspecting the werkzeug run_simple source code will illuminate that. Instead of callingrun_simple directly from the thread, instead you might consider calling your own function and giving yourself a way to signal that thread to tell it to stop. The only way to get werkzeug to stop seems to be a KeyboardInterrupt, so you'll need to be a little clever in how you manage that.
Use a random port each time you start, don't pick 5000 all the time. See https://pypi.org/project/portpicker/ as an example.
Since it's a local webserver, don't pick ports at all? If you're just communicating with the same process, invoke the WSGI services directly (as functions!) Why bother going through a network connection to call an HTTP server you defined in the same file?

You have lots of options - if a network connection is required, you need to do some better port management (ie closing those ports or spawning a subprocess that you can kill separately), but if I were you I'd consider not needing a port at all.

googlecolab / colabtools

Runtime crashes upon executing ```!fuser -k 5000/tcp``` or ```!kill <PID of port 5000>``` #4535