Failing suspected API request to not-running server: /hub/user/<user>/metrics

masaok commented 5 years ago

About 50 student users running on Ubuntu:

DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"

I'm getting tons of logs in /var/log/syslog that look like this:

Aug 29 06:25:07 jupyter1 python3[764]: [E 2019-08-29 06:25:07.113 JupyterHub log:174] 503 GET /hub/user/msalman2/metrics (msalman2@XXX.XXX.XXX.XXX) 10.43ms
Aug 29 06:25:12 jupyter1 python3[764]: [I 2019-08-29 06:25:12.090 JupyterHub log:174] 302 GET /user/msalman2/metrics -> /hub/user/msalman2/metrics (@XXX.XXX.XXX.XXX) 1.26ms
Aug 29 06:25:12 jupyter1 python3[764]: [W 2019-08-29 06:25:12.110 JupyterHub base:1175] Failing suspected API request to not-running server: /hub/user/msalman2/metrics

By "tons", I mean 17K+ 503 and 302 errors in a day.

However, when I log in via a browser and check /user/studentuser/metrics, it returns correctly:

{
rss: 122232832,
limits: { }
}

With a 200 in the logs.

I've tried this as an admin user and normal user, and everything looks normal in the logs, so I'm not sure what is causing this error when other users are using the site.

consideRatio commented 3 years ago

/metrics is an endpoint to track usage as reported by the started server. Typically, a service like promethus would scrape such endpoints for data, to be presented later by grafana etc.

I'm not sure if anything inside a TLJH deployment should be scraping such endpoint by default. Assuming that isn't the case, perhaps there is something we configure when we start our user servers that makes some external scraping service attempt to scrape them because something we configure?

To understand the logs, this is what happens:

/metrics is requested by the user server at /user/<username>
The user server is not started, or at least the traefik proxy doesn't know about it, and the proxy routes the request to the hub by sending a 302 response instead
The hub recognizes the kind of request as a failed attempt to reach something in a user server that isn't running and responds with 503. Btw, note that this error code can be switched to a 424 error code by using a modern version of JuptyerHub, see:

Action points

[ ] Figure out what makes the requests to the /metrics endpoint
[ ] Try figure out if TLJH is either directly responsible by making the requests itself, or by indirectly by configuring some property that makes other services start scraping the user server.
[ ] Discuss what we want as an outcome

Goal definition

I think if TLJH is announcing the user servers to be scrapable targets for some external service, we should make that an opt in option rather than opt-out or unconfigurable option. This should also be documented.

lahwaacz commented 1 year ago

I'm getting a lot of similar logs on JupyterHub 4.0.2 for the /jupyter/hub/user/<username>/api/kernels/<uuid> endpoint:

[I 2023-09-27 23:13:22.744 JupyterHub log:191] 302 GET /jupyter/user/<username>/api/kernels/<uuid>?1695849202650 -> /jupyter/hub/user/<username>/api/kernels/<uuid>?1695849202650 (@<ip>) 0.90ms
[W 2023-09-27 23:13:22.754 JupyterHub base:1444] Failing suspected API request to not-running server: /jupyter/hub/user/<username>/api/kernels/<uuid>
[W 2023-09-27 23:13:22.755 JupyterHub log:191] 424 GET /jupyter/hub/user/<username>/api/kernels/<uuid>?1695849202650 (<username>@<ip>) 3.76ms

I'm fairly certain the users don't run any scraper or something, so where does this come from? May it be that the user server was stopped due to inactivity and the client application keeps sending requests over and over, waiting for the user to start the server again?

minrk commented 1 year ago

@lahwaacz yes, most likely is that a user has a JupyterLab tab open after their server has been shutdown. JupyterLab is known to be overzealous at polling the server, especially after it's become unavailable.

The /metrics endpoint is used by jupyter-resource-usage (it's been deprecated quite some time ago in favor of /metrics/v1/...), so some packages might be out of date).

This particular poll is created here, with a max poll interval unspecified, so it uses the default of 30 seconds. That means that while a server is stopped, you will get these errors every 30 seconds from every browser left open pointing to a not-running server. It should only poll when the window is focused, though, so it's a bit weird that you're seeing so many, since that should suggest there are focused JupyterLab windows open and in the foreground on computers with no running server behind them. If the focus control isn't happening, it may be because things are out of date.

jupyter-resource-metrics could set a longer max poll interval so you get these less often, or it could add some logic to stop polling altogether after too many failures, waiting for a manual restart.

lahwaacz commented 1 year ago

@minrk Can you point to some JupyterLab issues about this behavior?

minrk commented 1 year ago

https://github.com/jupyterlab/jupyterlab/issues/3929 is the main one

Can you share your user and hub environments? I don't think /hub/user/metrics will be hit at all if packages are up to date, so maybe updating will fix the polling problem as well.

lahwaacz commented 1 year ago

@minrk I'm not getting hits on /hub/user/<user>/metrics, only /jupyter/hub/user/<username>/api/kernels/<uuid>.

I run JupyterHub on Arch Linux where the package names and versions are the following (most likely different from pip or other package managers):

jupyter-bash_kernel 0.9.0-2
jupyter-gnuplot_kernel 0.4.1-6
jupyter-lsp 2.2.0-2
jupyter-metakernel 0.30.1-1
jupyter-nbclient 0.8.0-1
jupyter-nbconvert 7.8.0-1
jupyter-nbformat 5.9.2-1
jupyter-nbgitpuller 1.2.0-2
jupyter-nbgrader-git v0.9.1.r1.g2ef44515-1
jupyter-notebook 7.0.4-1
jupyter-notebook-shim 0.2.3-2
jupyter-server 2.7.3-1
jupyterhub 4.0.2-1
jupyterhub-idle-culler 1.2.1-2
jupyterhub-systemdspawner 1.0.1-1
jupyterlab 4.0.6-1
jupyterlab-pygments 0.2.2-1
jupyterlab-pytutor 0.2.0-2
jupyterlab-rise 0.41.0-1
python-jupyter-client 8.3.1-1
python-jupyter-core 5.3.2-1
python-jupyter-events 0.7.0-1
python-jupyter-server-terminals 0.4.4-4
python-jupyter_telemetry 0.1.0-2
python-jupyterlab-server 2.25.0-1

minrk commented 1 year ago

Sorry, I mixed your report and the OP which reported /metrics.

The kernels list is definitely jupyterlab, though I can't say for certain that it's not in one of the extensions. It's definitely supposed to stop polling when connection is lost, so I think this is a bug in jupyterlab that polling continues when the connection is lost. I can't say for sure where, but it's worth reporting to jupyterlab.

references:

standby (which pauses polling) is meant to be set when disconnected or hidden
special jupyterhub handling of disconnect is here

lahwaacz commented 1 year ago

@minrk Thanks, I reported it in https://github.com/jupyterlab/jupyterlab/issues/15211. Sorry for mixing possibly unrelated issue here.

minrk commented 1 year ago

No worries, I think it's the same general issue, just the exact manifestation may change over time.

jupyterhub / the-littlest-jupyterhub

Failing suspected API request to not-running server: /hub/user/<user>/metrics #427

Action points

Goal definition