grpc / grpc-web

gRPC for Web Clients
https://grpc.io
Apache License 2.0
8.49k stars 764 forks source link

grpc-web memory leak(?) on long-running streams #1154

Open caesuric opened 2 years ago

caesuric commented 2 years ago

grpc-web stores the entire contents of a stream in memory, which bogs down and causes performance issues over time as well as using increasing amounts of memory. I have been able to work around by cancelling and restarting requests after they exceed a certain size, but was wondering if this is a memory leak or working as intended.

If this is how it's supposed to work, I was wondering what the best practices around managing long-running streams are supposed to be. I haven't found any documentation on the subject.

Thank you!

huangshuwei commented 2 years ago

+1

thoralt commented 2 years ago

571, #714, #920, #949, #952 and #1154 more or less describe the same problem. No progress since 2019, developers are busy implementing new features. We were forced to switch to https://github.com/improbable-eng/grpc-web which does keep its performance over long running connections.

caesuric commented 2 years ago

571, #714, #920, #949, #952 and #1154 more or less describe the same problem. No progress since 2019, developers are busy implementing new features. We were forced to switch to https://github.com/improbable-eng/grpc-web which does keep its performance over long running connections.

How is that by comparison? I have been going with the official implementation for maintainability reasons but if bug fixes are not forthcoming that may be a moot point.

My solution was to implement a StreamManager class that wraps a callback and checks whether the stream has exceeded a certain size on each message, resetting and restarting the connection if needed. Before that my application would bog down slightly after about half an hour of continuous runtime and become extremely slow after an hour.

thoralt commented 2 years ago

Your approach totally makes sense and I also thought of doing it that way. I'm not sure of the risks of changing a project to use improbable-eng/grpc-web. They call their code "production ready" and use it in their own products. Additionally, they contributed to the grpc-web spec in the past.

We did a few performance benchmarks back then and the two implementations behaved quite similar (except from that bug with long running streams). If your workaround is stable, I would keep it that way, although it feels odd.

sampajano commented 2 years ago

571, #714, #920, #949, #952 and #1154 more or less describe the same problem. No progress since 2019, developers are busy implementing new features. We were forced to switch to https://github.com/improbable-eng/grpc-web which does keep its performance over long running connections.

Thanks for tracking down the history of this issue.. The frustration is understood (and apologies for that!).. There has indeed not been enough resources and attention on making improvements to the library.. But regardless i agree major pain points like this should be fixed.

I'll try to take a look at this issue in the near future and see what could be done.

And just out of curiosity, for people on this thread, i wonder what's the typical duration of your stream, which you'd refer to as a "long-running" stream? And generally speaking, after how much time/ data transferred do you typically start to see this causing real issues? (Update: I just saw @caesuric mentioned about half to one hour stream, but i'm wondering if that's also when others observe this issue starts to manifest) Thanks! :)

huangshuwei commented 2 years ago

I've switched to microsoft SignalR

sampajano commented 2 years ago

From chatting with the team, it sounds like the most likely reason why the entire streaming body is retained is due to us still using XHR (which is known to keep the entire body in memory).

Switching to fetch() (tracked in https://github.com/grpc/grpc-web/issues/1134) would likely address this issue (among other benefits).

I'll try to verify the theory and take a stab at switching us to fetch() in the near future. :)

caesuric commented 2 years ago

And just out of curiosity, for people on this thread, i wonder what's the typical duration of your stream, which you'd refer to as a "long-running" stream? And generally speaking, after how much time/ data transferred do you typically start to see this causing real issues? (Update: I just saw @caesuric mentioned about half to one hour stream, but i'm wondering if that's also when others observe this issue starts to manifest) Thanks! :)

We have a requirement to run our entire system for 30 days nonstop in a future test, and I saw this bug while investigating scalability of the web client for that test. There is probably no need to have the web client run continuously for that entire time (since just refreshing the client fixed the issue) but it is worth noting the general time scale we are operating on. I believe the XHR calls started to dramatically slow down as the size of the stream approached a gigabyte or so. (We need to make our data transfers somewhat more byte-efficient but it was obvious that just reducing the size a bit wasn't going to get us enough mileage in terms of time scalability.)

One other concern I do have is whether this is also a problem on the other language-specific gRPC implementations we are using, which I haven't looked into yet. I am assuming it's not an issue there since this is not how streams generally work.

thoralt commented 2 years ago

And just out of curiosity, for people on this thread, i wonder what's the typical duration of your stream, which you'd refer to as a "long-running" stream? And generally speaking, after how much time/ data transferred do you typically start to see this causing real issues? (Update: I just saw @caesuric mentioned about half to one hour stream, but i'm wondering if that's also when others observe this issue starts to manifest) Thanks! :)

Our last project involved tracking large amounts of physical objects (>10,000) in real time and displaying their positions in a browser. This created a steady stream of positional data with a bandwith of >1 MB/s. After accumulating a few hundred megabytes over the course of minutes to hours the browser became more and more unresponsive because its CPU usage increased dramatically.

sampajano commented 2 years ago

@caesuric @thoralt Thanks so much for the additional info! It's very helpful! :)


I've discussed the possibility of switching to using Fetch/streams with the team, and here's some brief takeaways:

We'll keep the conversation open and circle back as we progressively make improvements to our fetch library in Closure.

Hope that makes sense.. :)

caesuric commented 2 years ago

@sampajano thanks for the updates. I appreciate the transparency.

avm99963 commented 2 years ago

I just saw this issue (I wish GitHub notified me when an issue I'm subscribed to was mentioned in another one). I also appreciate the transparency @sampajano!

sampajano commented 2 years ago

I just saw this issue (I wish GitHub notified me when an issue I'm subscribed to was mentioned in another one). I also appreciate the transparency @sampajano!

Thanks for the support @avm99963! 😃 I've added a comment to the issue you subscribed to just in case others also missed this update 😃

tommie commented 2 years ago

Fetch might have some feature gaps like cancellations, which could potentially be fixed but might still cause regressions.

Cancellation specifically seems broadly supported: https://caniuse.com/mdn-api_fetch_init_signal_parameter

undefined-user1 commented 2 months ago

Is this project abandoned? This seems like quite a sad thing to have to work around :/

sampajano commented 2 months ago

@undefined-user1 Thanks for your interest here. :)

No the project is not abandoned :)

We're working on Typescript migration. After which, better Fetch/service worker support would be high on the list.