Open shivghai opened 7 months ago
😬 This is going to be a tough one to figure out. Since we can't reliably reproduce this we're going to have to guess based on code changes. So realistically the only way for us to fix this is if you (or someone else who comes along in the future) can help narrow this down to a specific minor release, or, even better, a specific canary release.
@shivghai Could you please start by saying exactly what version you upgraded from and what version you first saw this issue on?
Upgraded from 6.6.2 to 7.4.2. FYI, I saw the topic/streaming label added - we're not using any Realtime features (just in case that was the assumption). Just to be clear, the 500s that we are seeing from the server are for GraphQL queries (presumably).
This is very reproducible for our deployed app (and I imagine it will be for others too). Just have to refresh any page with >1 queries very very quickly (do it about 20 times and maybe it's reproducible on your app?).
I'll keep this issue open while I try to figure out which release it is related to. Shot in the dark would be it's because of Node 20
Upgraded from 6.6.2 to 7.4.2. FYI, I saw the topic/streaming label added - we're not using any Realtime features (just in case that was the assumption). Just to be clear, the 500s that we are seeing from the server are for GraphQL queries (presumably).
Thanks for the clarification! I'm deep into React Server Components at the moment, so when I saw ERR_STREAM... my mind immediately jumped to React Streaming. As in streaming the frontend from server to client. But that was clearly not what you were talking about 😅
Hi @shivghai - just following up on the status and if we can close this issue or if there is something more we can investigate.
I haven't been able to investigate much. We're still seeing this though, just filtered this out of our monitoring. I imagine if you're actually running a production app you'd see this very very consistently
We're still seeing this though
The only thing I can think of is perhaps there is a connection reset -- either on server not accepting connections (or maxing out) or perhaps the database resetting the connection due to also maxing out the pool size?
Do you see any evidence of database connection issues in logs?
I am seeing the same errors:
We are also seeing this in production after going to RW7. Similarly to @shivghai, we've been filtering these messages out, but it is very noisy (happens ~10K times per day)
After a little digging, here are some observations:
Hey all, I'm collecting some more info about the deploy setup. What deploy provider you all are using? And if you are using something like docker-compose or other orchestration platforms?
@callingmedic911
We deploy serverfully to AWS ECS (fargate). No docker compose, but we do use nixpacks to build before deploying.
Here's a sample APM trace when this occurs:
POST /graphql
is a span from fastify on the server sidequery
and nested turquoise are graphql spans resolving the querySome observations:
@hankedori Thanks for digging in, your observations are pretty on point! @dthyresson and I paired earlier today and are on the same track.
This isn't causing an actual defect overall but I understand the log is noisy. I'll share more updates as I go.
Great to hear -- your time and attention is much appreciated!
I was conflating a couple symptoms into one issue.
1) log line with 🌲 stream closed prematurely
https://community.redwoodjs.com/t/stream-closed-prematurely/7314/3
This is intentional behaviour. https://github.com/fastify/fastify/pull/2190#discussion_r405683891
2) The one reported here:
Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
...
I haven't been able to replicate the exact log/error with. I'd appreciate if someone can share a repro.
@callingmedic911 Deployment - ECS + Fargate. My repro is clicking around a page which fires off multiple queries very very quickly
What's not working?
Never really saw ERR_STREAM_PREMATURE_CLOSE 500s in our logs ever before. After the upgrade to 7.x, on our deployed app, I see this issue constantly:
Initially saw this issue (500s accompanied with this error) happen with our end to end tests, and then with some customers. Was eventually able to reproduce it, I kid you not, by just refreshing the page a few times. Can't repro this locally because presumably the stream interruption can't be reproduced without a certain amount of latency expected in an actual deployed scenario.
I don't think we should have to handle 500s due to this specific issue, though please advise if this is expected behavior. This seems to be a lower level issue being bubbled up as a 500.
How do we reproduce the bug?
Deploy app + refresh a page with some graphql queries (presumably) a bunch of times
What's your environment? (If it applies)
Are you interested in working on this?