Loosing connection and can't re-establish it

BeatWolf commented 4 months ago

Describe the bug

After a good day of debugging i'm stuck and that is why i'm coming here.

I have a gradio app, using the latest gradio version, which has a bunch of chatbot input fields that connect to an LLM. It is a kind of RAG application. The application is deployed on kubernetes with an nginx ingress.

So now to the error. When using the application, sometimes i randomly loose connection. This looks like this in the browser:

And like this in the console:

Now, this is kind of a double issue. The first is loosing the connection. Both the heartbeat and the eventstream get closed. I suspect by the ingress, as i see no errors on the gradio server pod (is there a way to add verbose output there?).

I tried pretty much all the nginx annotations i could find that could remotely be related to the problem, nothing helped. Just to be complete, here they are:

 annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600s"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600s"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "3600s"
    nginx.ingress.kubernetes.io/client-body-timeout: "3600s"
    nginx.ingress.kubernetes.io/client-header-timeout: "3600s"
    nginx.ingress.kubernetes.io/keepalive-timeout: "3600s"
    nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
    nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
    nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-redirect: "off"
    nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
    nginx.ingress.kubernetes.io/enable-websocket: "true"
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "route"
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
    nginx.ingress.kubernetes.io/server-snippets: |
      proxy_buffering off;
      proxy_redirect off;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-Host $host;
      proxy_set_header X-Forwarded-Proto $scheme;

Now, i don't have a reproduction example, because the problem does not happen locally, only on the server.

This brings us to the second issue. When the error happens, i can no longer use the application until i refresh the page. Now, connection can be lost, thats not an issue i would say. What i find strange, is that there is no reconnection from the gradio side. I also searched on how to force a reconnection (maybe catching errors and simply reconnecting), but i found nothing in the documentation.

So, i know its a bit mysterious. If anybody can help to understand the random connection drops, great. But i would already be very happy if i could somehow tell gradio to reconnect when there is an error with the connection, some kind of retry (which i know will work, because the server is fine).

Have you searched existing issues? 🔎

[X] I have searched and found no existing issues

Reproduction

import gradio as gr

No particular code required, works locally but not remotely

Screenshot

No response

Logs

No response

System Info

psycopg2-binary==2.9.9
pgvector==0.2.5
asyncpg==0.29.0
sqlalchemy[asyncio]~=2.0.26
greenlet==3.0.3
llama-index-core==0.10.57
llama-index-readers-file==0.1.30
llama-index-vector-stores-postgres==0.1.11
llama-index-embeddings-huggingface==0.2.1
llama-index-llms-llama-cpp==0.1.4
llama-index-llms-ollama==0.1.6
transformers==4.40.1
pymupdf==1.24.2
tqdm~=4.66.2
gradio==4.39.0

Severity

Blocking usage of gradio

nahuaque commented 4 months ago

Running into a very similar connection loss issue using Gradio on AWS SageMaker JupyterLabs for long-running processing.

It looks like downgrading Gradio to 3.50.2 helped, it seems to be working reliably now.

BeatWolf commented 3 months ago

Downgrading to 3.50.2 does not seem to solve the issue of not reconnecting

edit: My bad. This actually solves it. Does this mean gradio does not work on unreliable internet connections with all recent versions?

nahuaque commented 3 months ago

Long-running processing (> 2 mins) is still a bit flaky on 3.50.2, but it works most of the time. Whereas for 4.29, it never works, the connection is always lost.

tyxsspa commented 2 days ago

Hi, any updates? I meet the problem too when upgrading to 4.xx. It turns out that there is always lag or lost connection.

gradio-app / gradio