[Bug?]: Seeing 500s due to ERR_STREAM_PREMATURE_CLOSE after upgrade to 7.x

shivghai commented 7 months ago

What's not working?

Never really saw ERR_STREAM_PREMATURE_CLOSE 500s in our logs ever before. After the upgrade to 7.x, on our deployed app, I see this issue constantly:

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at Readable.onclose (node:internal/streams/end-of-stream:154:30)
    at Readable.emit (node:events:518:28)
    at emitCloseNT (node:internal/streams/destroy:147:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)

Initially saw this issue (500s accompanied with this error) happen with our end to end tests, and then with some customers. Was eventually able to reproduce it, I kid you not, by just refreshing the page a few times. Can't repro this locally because presumably the stream interruption can't be reproduced without a certain amount of latency expected in an actual deployed scenario.

I don't think we should have to handle 500s due to this specific issue, though please advise if this is expected behavior. This seems to be a lower level issue being bubbled up as a 500.

How do we reproduce the bug?

Deploy app + refresh a page with some graphql queries (presumably) a bunch of times

What's your environment? (If it applies)

System:
    OS: macOS 13.6
    Shell: 5.9 - /bin/zsh
  Binaries:
    Node: 20.11.1 - /private/var/folders/8y/99byt5ms31d1z8jgjfmyvcg80000gn/T/xfs-f1930489/node
    Yarn: 3.2.1 - /private/var/folders/8y/99byt5ms31d1z8jgjfmyvcg80000gn/T/xfs-f1930489/yarn
  Databases:
    SQLite: 3.39.5 - /usr/bin/sqlite3
  Browsers:
    Chrome: 124.0.6367.62
    Safari: 16.6
  npmPackages:
    @redwoodjs/auth-auth0-setup: 7.4.2 => 7.4.2 
    @redwoodjs/cli-data-migrate: 7.4.2 => 7.4.2 
    @redwoodjs/core: 7.4.2 => 7.4.2

Are you interested in working on this?

[ ] I'm interested in working on this

Tobbe commented 7 months ago

😬 This is going to be a tough one to figure out. Since we can't reliably reproduce this we're going to have to guess based on code changes. So realistically the only way for us to fix this is if you (or someone else who comes along in the future) can help narrow this down to a specific minor release, or, even better, a specific canary release.

@shivghai Could you please start by saying exactly what version you upgraded from and what version you first saw this issue on?

shivghai commented 7 months ago

Upgraded from 6.6.2 to 7.4.2. FYI, I saw the topic/streaming label added - we're not using any Realtime features (just in case that was the assumption). Just to be clear, the 500s that we are seeing from the server are for GraphQL queries (presumably).

This is very reproducible for our deployed app (and I imagine it will be for others too). Just have to refresh any page with >1 queries very very quickly (do it about 20 times and maybe it's reproducible on your app?).

shivghai commented 7 months ago

I'll keep this issue open while I try to figure out which release it is related to. Shot in the dark would be it's because of Node 20

Tobbe commented 7 months ago

Upgraded from 6.6.2 to 7.4.2. FYI, I saw the topic/streaming label added - we're not using any Realtime features (just in case that was the assumption). Just to be clear, the 500s that we are seeing from the server are for GraphQL queries (presumably).

Thanks for the clarification! I'm deep into React Server Components at the moment, so when I saw ERR_STREAM... my mind immediately jumped to React Streaming. As in streaming the frontend from server to client. But that was clearly not what you were talking about 😅

dthyresson commented 5 months ago

Hi @shivghai - just following up on the status and if we can close this issue or if there is something more we can investigate.

shivghai commented 5 months ago

I haven't been able to investigate much. We're still seeing this though, just filtered this out of our monitoring. I imagine if you're actually running a production app you'd see this very very consistently

dthyresson commented 5 months ago

We're still seeing this though

The only thing I can think of is perhaps there is a connection reset -- either on server not accepting connections (or maxing out) or perhaps the database resetting the connection due to also maxing out the pool size?

Do you see any evidence of database connection issues in logs?

tilmann commented 4 months ago

I am seeing the same errors: CleanShot 2024-07-18 at 12 43 11

hankedori commented 4 months ago

We are also seeing this in production after going to RW7. Similarly to @shivghai, we've been filtering these messages out, but it is very noisy (happens ~10K times per day)

After a little digging, here are some observations:

Happen sporadically when resolving any gql query (no pattern here)
Error is thrown after the request has been handled
Does not correlate with db (postgres) connection issues

callingmedic911 commented 4 months ago

Hey all, I'm collecting some more info about the deploy setup. What deploy provider you all are using? And if you are using something like docker-compose or other orchestration platforms?

hankedori commented 3 months ago

@callingmedic911

We deploy serverfully to AWS ECS (fargate). No docker compose, but we do use nixpacks to build before deploying.

Here's a sample APM trace when this occurs: Screenshot 2024-08-15 at 17 27 18

Purple box is a span from the browser session
POST /graphql is a span from fastify on the server side
query and nested turquoise are graphql spans resolving the query

Some observations:

Usually, you would expect the browser and fastify spans to span the entire trace and respond after graphql has finished resolving the query
Fastify is responding to the request prematurely (with 200 success) while the rest of the query resolves
Once the query finishes resolving, it seems to try to respond to the request which has already been responded to resulting in the premature close
One theory is that this could be tied to request cancellation on the client side which we haven't tested

callingmedic911 commented 3 months ago

@hankedori Thanks for digging in, your observations are pretty on point! @dthyresson and I paired earlier today and are on the same track.

This isn't causing an actual defect overall but I understand the log is noisy. I'll share more updates as I go.

hankedori commented 3 months ago

Great to hear -- your time and attention is much appreciated!

callingmedic911 commented 3 months ago

I was conflating a couple symptoms into one issue. 1) log line with 🌲 stream closed prematurely https://community.redwoodjs.com/t/stream-closed-prematurely/7314/3 This is intentional behaviour. https://github.com/fastify/fastify/pull/2190#discussion_r405683891

2) The one reported here:

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
...

I haven't been able to replicate the exact log/error with. I'd appreciate if someone can share a repro.

shivghai commented 3 months ago

@callingmedic911 Deployment - ECS + Fargate. My repro is clicking around a page which fires off multiple queries very very quickly

redwoodjs / redwood