[Background reports] Infinite loader

RachL commented 11 months ago

Description

A test scenario which end up with an infinite loader for the user.

Steps to Reproduce

Go to https://coopcircuits.fr/admin/reports/packing/customer as hub 922 or super admin with hub 922 as filter
Change start date for January &st 2023 (keep end date as is)
Ask for xlsx file
See you get an email after 3 minutes with the file, while the report page still shows the loader

Severity

S3 as we get the email in the end

Your Environment

Version used: v4.4.17
Browser name and version: Firefox and chrome
Operating System and version (desktop or mobile): desktop

mkllnk commented 10 months ago

I just tested this and I did see the download link. I could also download the report (4.4 MB). And I got an email.

@RachL Can we close or can you reproduce?

RachL commented 9 months ago

FYI I'm waiting for big OC to be back online after holidays to do a final test. I'm getting various results given the time I'm testing this, so I want to see if it's linked to traffic on the instance or not. Given we haven't changed anything that's the only reason I could think off.

mkllnk commented 9 months ago

Okay, @RachL, maybe we treat this as a bug now and close the epic of background reports?

RachL commented 9 months ago

@mkllnk FYI I've reproduced it. Can it be linked to the number of reports run in parallel/at the same time?

mkllnk commented 8 months ago

I've reproduced it.

Great, how?

Can it be linked to the number of reports run in parallel/at the same time?

Are you running multiple reports in the same tab? I would need to look into that. Hm, no, you can't be running another report because you are waiting for the first one in that tab. But maybe you are running another one in another tab. It should work though. Let me know how I can reproduce it and then I may find the culprit.

RachL commented 8 months ago

@mkllnk same scenario that is in the issue. Only one tab opened. However this scenario passes if I'm trying during low hours (evening for example). So maybe when you try on your timezone it will always work?

I guess maybe this one we can't fix it atm and we need to have more occurrences before we can do something.

mkllnk commented 8 months ago

Interesting, I just tried to load that report and I didn't even see the loading screen. The console gave me a hint though:

The connection to wss://coopcircuits.fr/cable was interrupted while the page was loading.

The site was generally very slow.

mkllnk commented 8 months ago

Nothing in Bugsnag. The report format doesn't matter, by the way.

mkllnk commented 8 months ago

I also didn't need the big date range. The default of one month still triggered the problem.

Firefox used to have a bug related to the error message here and I tried the suggested workaround by changing the Firefox option to false but it didn't help:

network.http.http2.websockets

mkllnk commented 8 months ago

I haven't gotten a single report rendered in the browser yet. The connection is slow but the server is not busy. It's only 5am in France now. UK staging works without any problem. AU production works without problem. There's something weird about the websocket connection.

mkllnk commented 8 months ago

fr_prod is the only server with increased puma workers and threads. I tried setting it back to our default but that didn't change anything.

mkllnk commented 8 months ago

I was wondering what's different about fr_prod. It runs Ubuntu 20 which is otherwise only used by India and New Zealand. On nz_prod I found intermittent problems as well.

When I load the page, there's one cable connection initiated straight away. Another cable connection is request is send several seconds later. If I try to render the report before the second request, nothing happens. But if I do it after the second request, the loading indicator comes up and then the report.

I still need to find out how to debug websockets properly.

mkllnk commented 8 months ago

Turns out that Firefox doesn't have the messages tab in my dev tools. :shrug: Chromium has it though. I got the loading message and plenty of ping messages but when the report finished, it didn't get sent to the browser even though the scoped channel was connected properly.

mkllnk commented 8 months ago

There's a weird thing that the browser is subscribing twice to the channels. And it's in different order.

uk-staging (success):

fr-prod (fail):

mkllnk commented 8 months ago

I have to stop here for now. Ideas for investigation:

Why does fr-prod initiate the session channel straight away and the scoped channel has to wait until the page is loaded while uk-staging does it the other way round?
Could the asset compilation make a difference?
Are we loading some JS twice?
Could a firewall make a difference?

dacook commented 8 months ago

@rioug found this: WebSocket error occurred: Broken pipe in the logs : https://discuss.rubyonrails.org/t/getting-a-broken-pipe-error-with-websockets/80841 Points toward an issue with our redis config, also mentioned in the stackoverflow link above. I reckon we should start with that.

Also my investigations of the logs on au_staging for the bulk products screen: StimulusReflex issues (WebSocket broken pipe)

rioug commented 8 months ago

This has an example of the redis config to potentially fix the issue : https://stackoverflow.com/questions/76255963/rails-actioncable-sporadically-failing-to-subscribe

rioug commented 8 months ago

I had a look at the solution offered in the link above, I tried increasing the number of Redis connection as explained but it broke :

New unhandled error in staging from OpenFoodNetwork Aus in gems/redis-client-0.20.0/lib/redis_client/config.rb:21 (details)
Unhandled error
ArgumentError: unknown keyword: :pool_size
Location
gems/redis-client-0.20.0/lib/redis_client/config.rb:21 - initialize

I had a look at the various gem we use, redis-client does support a connection pool via connection_pool gem , see https://github.com/redis-rb/redis-client?tab=readme-ov-file#usage. But the 'redis' gem doesn't seem to be using it. Instead they recommend the following https://github.com/redis/redis-rb?tab=readme-ov-file#connection-pooling-and-thread-safety

I tried to use the redis-client connection pool directly via redis but it broke. So I ended up trying this patch, adding a ConnectionPool::Wrapper when initialising the redis client

# @client = initialize_client(@options)
@client = ConnectionPool::Wrapper.new(size: 50, timeout: 3) { initialize_client(@options) }

https://github.com/redis/redis-rb/blob/7cc45e5e3f33ece7e235434de5fbd24c9b9d3180/lib/redis.rb#L73 It didn't break ! but I doesn't seem to fix our problem. I feel like I am chasing the wrong issue.

Further idea to try:

look at our proxy timeout as explained here : https://stackoverflow.com/questions/76255963/rails-actioncable-sporadically-failing-to-subscribe
monitor redis connection/status to see if the problem is actually coming from there
review our usage of StimulusReflex/CableReady on the page, we seem to a few things on page load

mkllnk commented 7 months ago

I've read in some threads that connecting to the same channel multiple times in one page causes issues. So one approach we could try would be to create more channels for different uses. Not as dry but maybe a bit clearer in the communication.

dacook commented 7 months ago

Sorry I didn't get very far with this. I found there are two parts to cable_ready, and we had inconsistent versions. So I tried to upgrade both to the latest version, but had problems. So instead I will downgrade one to ensure they're both on the same (older) version, and will see if it makes any difference.

Regarding different channels: I didn't read enough docs to set this up yet, but would try that next.

dacook commented 7 months ago

Next steps:

try replicating issue to confirm it still exists after recent updates:
- with background reports in Dev (replc notes on issue)
- on a server environment fr_prod
try removing scoped channel (used by background reports) and see if that helps BUU
try adding new channel for BUU
investigate: is our code intentionally trying to connect twice on SessionChannel?
- is there an intentaional hack in plac (eg waiting for Angular to initailise) for this?
- or maybe just misconfigure..
consider asking Matt who is currently active on the SR project.

RachL commented 7 months ago

@dacook FYI https://community.openfoodnetwork.org/t/backoffice-ui-product-list-uplift-rollout/2921/28?u=rachel

kirstenalarsen commented 7 months ago

I've upgraded this to S1 because we can't release BUU or the new design without resolving it. There is a lot of time, effort and money sitting behind getting this out and this needs to be solved :)

dacook commented 7 months ago

Hi @Matt-Yorkley , we are having trouble with StimulusReflex which we can't get to the bottom of, and were wondering if you'd be able to take a look? At the top of this issue is a replication case which I believe is still current (and that you would have access). The same issue has also been occurring on the new bulk products screen (/admin/products with feature toggle admin_style_v3 enabled), although I can't provide a replication case at the moment.

You can see some of the things we've investigated in comments above. We found that the websocket connection was breaking ("WebSocket error occurred: Broken pipe"), but don't know why or what we should do to fix it. Can you see any issues with our configuration or use of SR?

Matt-Yorkley commented 7 months ago

I could give you my current assessment, but it's exactly the same as my previous assessment from a year ago and I don't think anyone liked it the first time :sweat_smile: :point_right: https://openfoodnetwork.slack.com/archives/G22SQCKJA/p1682607389089729?thread_ts=1682588714.127649&cid=G22SQCKJA

I'll repost it here for reference (from Slack):

I'm taking a look at the issues and the code but I think my assessment is that pushing reports generation into a background job will not ultimately solve the performance problems, if you're getting server-melting levels of RAM and CPU usage

...

So the thing with the reports code is that it ends up having to deal with astronomically large datasets. Like, if you load a report that involves all line items for all orders in the range and you set the range to 6 months, you might be dealing with 30,000 results, which is totally bananas. As it turns out, the kind of things you would normally do with sets of objects in normal Rails code when dealing with 20 or so objects are not really adequate at that scale and can easily blow up, essentially because the effects of any little things you do can be multiplied by 30,000 times. Things like iterating over each object in the set and calling a method on it or calculating a running total for each item in an order. I devised a novel way of handling those large datasets which was a bit weird and quite nuanced and intentionally did not handle data in the normal way you would do things in a normal bit of Rails code. Uh... so the problem is that I didn't leave an instruction manual and the reports code that's been written since then kind of handles things in the way you would write normal Rails code, which doesn't work at that scale and effectively is not too dissimilar to the old reports code. To be fair, there are genuinely hard problems involved in making the reports do what they need to without blowing up when they hit astronomically large datasets. Shifting that workload to a background job will not make the current RAM and CPU usage vanish, it will just move it around a bit. Sorry to be the bearer of bad news, I assume that is absolutely not what anybody wants to hear

...

@maikel yeah I agree it could have other benefits, like improving UX and allowing more scope for puma timeouts to be reduced. but if we're shifting heavy workloads out of puma and in to sidekick, there's also a chance it could create other problems that we've never encountered before, like... what happens if sidekick crashes or gets too clogged up on huge jobs? potentially emails will not be sent? etc? We don't know until we try it, but I suspect there could be a kind of out-of-the-frying-pan-into-the-fire type situation...

...

like one of the main things that sidekick excels at is processing large volumes of relatively small jobs with incredible speed. Will we need to implement a sidekick timeout mechanism to stop it getting overloaded? I don't know

Sooo... aside from the recent issues with mismatched gem/npm versions (which it looks like you've fixed now) I think the apparent issues with ActionCable etc are symptoms and not causes; the real problem is the same issue reports have always had, which is that the performance is so bad it makes the servers melt when big reports are being generated or multiple users are generating reports. My guess is that ActionCable (and probably other parts of the app) are just having a bad time at the point the server is getting overloaded, and that's manifesting in weird ways which are now less obvious because reports have been kicked into the background (swept under the rug to some extent). There's a clue here in Rachel's comment in relation to French prod:

However this scenario passes if I'm trying during low hours (evening for example)

:point_up: :point_up: :point_up:

Redis connection pooling might potentially alleviate this a little bit in the short term (I think Rails uses it by default with a pool of 5, but we could bump it to something like 15) but I don't think it will solve the fundamental issues. I think the latest version of the redis-client gem also includes the connection_pool gem by default, but the public interface for configuring it has changed a bit (so check the latest docs).

The solution is to do some really gnarly work on reports performance and fundamentally change the way reports are generated, but it's not a simple job. It would look a bit like this: https://github.com/openfoodfoundation/openfoodnetwork/pull/8145

Just a side note in relation to this comment from @mkllnk:

Hm, no, you can't be running another report because you are waiting for the first one in that tab.

There's no waiting any more! Multiple report generation jobs can now be enqueued at the same time and even by the same user in the same tab. When you hit the button, the job is added to the queue. Since Sidekiq has 5 workers and will hungrily grab anything that's in the queue, in theory if a user tries to generate a large report and it's taking a long time and they hit the button again (maybe multiple times), then Sidekiq can be generating up to 5 massive reports simultaneously in five separate threads, where even generating one might have caused issues. There's nothing stopping or limiting that, and there's no mechanism to cancel previously queued jobs if a new one is created by the same user. This is what I was talking about when I said "I suspect there could be a kind of out-of-the-frying-pan-into-the-fire type situation". In theory you could implement some basic mechanisms there for cancelling enqueued jobs if the same user tries to generate the same report before the previous one has finished or something, but that's not happening currently.

mkllnk commented 7 months ago

@Matt-Yorkley, we probably didn't give you the full picture. I've reproduced this problem locally in dev with small reports. A simple POST is triggering the report generation and even if we ignore the background job, sometimes even the simple cable_ready event to display the spinner doesn't get through to the browser. There's some problem with the subscription to channels. And we were hoping that you may have encountered something like that before. Searching online brings up a few threads about people having problems when a page is subscribing multiple times to the same channel which we seem to do. In theory that's fine but sometimes it fails. It seems to be a race condition. The only advice we found was to create more channels and don't subscribe to the session channel twice.

Edit: and we face the same problem with the products screen which has nothing to do with background jobs.

Matt-Yorkley commented 7 months ago

I had a quick look and spotted this too: for each row in a report, the value of Spree::Config[:currency] gets checked (each time) which causes a read from the cache (Redis). So if there are thousands of rows in a report, there will be thousands of sequential cache reads triggered all at once. It looks like this:

Screenshot from 2024-03-22 01-52-55

That might be partly what's clogging up Redis, which could potentially also cause issues with Sidekiq and/or ActionCable (websocket subscriptions are handled by Redis's pub/sub layer). The fun part is: the value of that preference is just a string like AUD, and it's available in an ENV variable so it doesn't even need to be read from Redis in the first place via preferences (you can just do ENV["CURRENCY"] instead of Spree::Config[:currency]).

Matt-Yorkley commented 7 months ago

@mkllnk in the screenshot you posted above there's duplicate subscription requests from the client side, but the server only sends back one corresponding subscription confirmation in each case. I don't think that results in a duplicate (only one request is confirmed per subscription)? I can't replicate it on current master either.

Can you still reproduce it and if so do you have some steps?

Matt-Yorkley commented 7 months ago

It looks like the products index page (as well as the shopfront and other pages) also fire off a massive array of repeated cache reads via Spree::Config[:currency] and other things like Spree::Config[:available_units], at the point just after the page has loaded in the DOM (which is also when ActionCable is trying to access Redis). I would look at reducing/removing those and maybe look at Redis connection pooling and see if that helps. I can't see anything wrong with the rest of the configuration.

mkllnk commented 7 months ago

Can you still reproduce it and if so do you have some steps?

A quick local test didn't reproduce it but the original steps:

Go to https://coopcircuits.fr/admin/reports/packing/customer as super admin.
Select hub 922 as filter.
Click "Go" to render the report on screen. Default date range of a month.
Sidekiq indicates that the report runs for 40 seconds.
But the spinner never disappears.

Gaetan tried to adjust the redis connections but that didn't help. I can observe four cable connections in my browser. One receives the loading spinner. Another later one is still stuck in a loop trying to connect.

There's definitely something strange happening here. While one cable connection is open, another one is opened and at some point the old one is closed. It looks like this:

Screenshot from 2024-03-22 16-39-23

Maybe there are times when the server is busy due to the report and the connection is closed because the ping is not delivered in time? And if there's a gap in the connection then we may miss events like that the report is ready?

Matt-Yorkley commented 7 months ago

Okay, I can see a bit more after trying it in both French prod and in dev. Here's my notes:

When I load the customer packing page, the app is generating an entire report even when the form hasn't been submitted yet and then basically discarding it. It's not even shown on the page. It shouldn't be doing that. The page takes ~35 seconds to load on French prod as admin (without even submitting the form). During this time ActionCable is trying to establish a websocket and maybe fails and retries to reconnect once or twice, then it succeeds, but the client side struggles to get subscription confirmations (what you see in that screenshot you posted is the client-side component of ActionCable aggressively trying to establish it's subscriptions but getting no answer from the server. It's actually "normal behaviour" from the point of view of the little javascript class that's responsible for establishes subscriptions, the abnormal bit is that it's receiving no responses for about 30 seconds when the page loads). That failure to get any response from the server is still potentially ongoing when I submit the form.

The underlying causes are what I've described above, it's reports performance issues. In this case it's triggering an incredibly expensive report generation for no reason, which is hammering the CPU and memory and it's also hammering Redis with hundreds if not thousands of superfluous requests at the same time (see above), and ActionCable is struggling to connect or send/receive any messages for a brief period. As a side note; if you wait for the page to properly finish and settle down before submitting the form it works. The problem is worse (more visible) on French prod because it's production-scale data, and the problem is less bad during off-peak hours because the server has more resources available. Before pushing the report generation "into the background" this was essentially happening in Puma in the main thread instead, and blocking the page render, hence the problem presented differently.

Side note: a similar thing is happening on the v3 version of the products admin page.

It's like the reveal at the end of every Scooby Doo episode where it turns out the ghost was actually the old janitor...

ofn-reports

Matt-Yorkley commented 7 months ago

To clarify a bit more: some of the reports pages are generating a whole report when the page loads just to get the column headers so it can list them in the dropdown in the form that lets you choose report columns to show or hide, which is insane. That's causing a fairly short-lived but substantial spike in resource use exactly at the point ActionCable is trying to connect.

So firstly it needs to stop doing that just to display the form, and secondly the reports generally need to not be melting the server and hammering Redis in the way the currently do. I've had a quick look and I estimate reports can be at least 10 times and in some cases 20-25 times faster if they're correctly using the QueryInterface and if some of the performance regressions that have accumulated in the last few years are carefully removed.

And I think that would also solve various other issues like this one: https://github.com/openfoodfoundation/openfoodnetwork/issues/11992#issuecomment-2014230531 (it's Scooby Doo reveal time again, the monster is reports)

Matt-Yorkley commented 7 months ago

If you haven't grokked what I mean about Redis getting hammered via things like Spree::Config[:currency], I can give an example: imagine we're generating a report and there are 5000 line items in the date range. When we generate each line, if it shows a currency symbol by calling a helper method which calls Spree::Config[:currency] that's +5000 calls to Redis. If it shows an international formatted string it might call Spree::Config[:currency_thousands_separator] for another +5000. Then maybe it show something related to unit weight, which calls WeightsAndMeasures, which loads the Spree::Config[:available_units] value, for an additional +5000 requests. ActionCable is trying to get one or two small messages though to Redis at this point, while it's under an avalanche of other requests.

Under normal circumstances, there might be say 20-30 requests to Redis when loading a page, and it's fine. Redis is fast. It's a miracle of software engineering. But when we're getting into the tens-of-thousands range for a single action from a user then it's not going to be pretty. Redis is still a database at the end of the day.

Soooo... if we memoize some of those little values for relatively-static config options that can be called many times in a report (maybe along the lines of what you've done recently in https://github.com/openfoodfoundation/openfoodnetwork/pull/12004) we could be making zero additional requests to Redis in the above scenario, instead of 15 thousand. It would also help in other parts of the app (serializing products, etc) where lots of objects can repeatedly fetch the same config value.

Those big scary numbers (15k requests) represent a relatively small report, if we consider a shop with 100 orders per week and 10 line items per order, 5000 line items is only 5 weeks of data. A large hub loading a year's worth of data might be more like 260,000 line items. This is my point about seemingly innocuous things that would be totally fine in a normal bit of Rails code being potentially catastrophic in reports code, due to the difference in scale of the data being handled. It's a different ballgame, there's big consequences to small actions in a way that's not intuitive. Reports code requires a totally different approach than regular code, or it blows up.

Matt-Yorkley commented 7 months ago

So to summarise, there are multiple overlapping problems, which are:

Some reports pages are generating a full report just to display the columns in a dropdown in the form. That needs to be fixed, and it should be easy.
Some of the values that get called via Spree::Config are getting called repeatedly for no reason and they should be memoized differently (in app memory instead of the cache). That's really easy to fix.
Some sort of mechanism around the enqueuing of report jobs would be good, so that if the same user tries to generate a report with the same params within a certain period we check if the same job has already been placed or if it's already running, and don't add a duplicate to the queue but instead show the existing report or wait for the current one to finish. That needs some finesse, but it's not too hard.
Performance in reports is bad and causes various problems. That can also be fixed, but it's not so easy.

I think if you fix those two easy problems first there's a good chance that this issue will be mostly resolved, or at least not an S1 any more.

rioug commented 7 months ago

Redis connection pooling might potentially alleviate this a little bit in the short term (I think Rails uses it by default with a pool of 5, but we could bump it to something like 15) but I don't think it will solve the fundamental issues. I think the latest version of the redis-client gem also includes the connection_pool gem by default, but the public interface for configuring it has changed a bit (so check the latest docs).

I looked into this, and it's true that the redis-client gem includes connection_pool, but I couldn't see anyway to use/configure it. I tried patching some code but it did not seem to work. Maybe something to revisit if fixing call to Spree::Config doesn't solve the problem.

dacook commented 7 months ago

Wow, thanks for taking such a deep dive Matt, and providing a new perspective. This is really illuminating. I find it concerning that heavy resource usage can have indirect impacts like that, but it seems to make sense.

Thanks for fixing the first two points, they look good to me. Can you confirm the PRs are ready for review?

rioug commented 7 months ago

So I went and check New relic and we have monitoring for redis. You can find it under, infrastrutures -> Third party services -> redis. However it's not configured properly. It's using port 6379 when we use port 6380 for our rails app. I fixed on au_staging (see :https://newrelic.com/blog/how-to-relic/how-to-monitor-redis-performance). I can now see spike when loading the product page, and I also observed the client side spamming for connection. It points to the same problem Matt is describing above, ie redis gets overloaded on page load.

It would also explain the Broken pipe errors in the logs, from a comment above:

@rioug found this: WebSocket error occurred: Broken pipe in the logs : https://discuss.rubyonrails.org/t/getting-a-broken-pipe-error-with-websockets/80841 Points toward an issue with our redis config, also mentioned in the stackoverflow link above. I reckon we should start with that.

This gives some data to check if Matt's PR fixes the issue.

Additional TODO :

Fix redis monitoring config on all servers
prod AU is missing configuration for third party services, Redis, Postgres etc... fix this.

dacook commented 6 months ago

Hi @filipefurtad0 , when the problem has occurred, this error occurs in the Rails server logs: E, [2024-02-06 04:44:06 #8553] ERROR -- : WebSocket error occurred: Broken pipe

In the browser, I think no error appears in the dev tools. The browser is simply waiting for the server to respond, yet it never does.

So when testing, if you see the above error, you'll know straight away that it failed.

filipefurtad0 commented 6 months ago

E, [2024-02-06 04:44:06 https://github.com/openfoodfoundation/openfoodnetwork/issues/8553] ERROR -- : WebSocket error occurred: Broken pipe

Perfect - thanks @dacook ! I'll look for that one :pray:

filipefurtad0 commented 6 months ago

Hey @dacook ,

I could not really spot the error, although I've tried to rock the server: several parallel tabs open, long reports as superadmin... I did got to see the infinite loader, if I run the report before those PRs. The report seems to render (the email also arrives :muscle: )after merging the PRs, which could indicate the issue is fixed.

Still, I've added the prod-test label, so we keep this one in mind, after deploying the patch release.

dacook commented 6 months ago

The related PR has been deployed for this, I will test shortly.

dacook commented 6 months ago

I've tested on au_prod and fr_prod, and unfortunately couldn't see any existing problems, or conclusive improvements for the bulk loading of records. (This might be partly because I forgot to test before deploying 🤦). However, it's good to see the packing report screen loads much quicker now, and I've found a potential lead for improving the BUU products screen loading!

To fix:

[ ] [BUU] Products screen: Rails is often downloading the image for each product.

dacook commented 6 months ago

Hokay, replicated on fr_prod (v4.4.36-modified, with latest Redis fix) this morning while I was looking for something else. I think this is consistent with what we were seeing before, but I feel like I have a new perspective.

Page loads, opens up a websocket and spams the server trying to subscribe to both SessionChannel and StimulusReflex::Channel for 11 seconds until 09:37:55
After a few seconds break, a new websocket is opened. SR subscribes successfully, and requests reflex Products#fetch. A second later, the server stops sending back a ping...
After a 10 second break, a third websocket is opened. It achieves subscription, but knows nothing about what requests were sent already, so does nothing.

The "Broken pipe" error is occurring regularly (several times in the last hour, I can see shoppers are currently using the site). Interestingly it wasn't logged at the same time as this example. The next one was over 5 mins later.. maybe the request took that long.. (but I wasn't able to interpret the logs to prove it) E, [2024-03-27 22:45:34 #2192118] ERROR -- : WebSocket error occurred: Broken pipe

There was a spike in CPU usage, and a minor spike in Redis usage at that time, quite likely due to the Products#fetch. As I mentioned, there are other shoppers using the system at the same time. Screenshot 2024-03-28 at 9 48 25 am Screenshot 2024-03-28 at 9 49 38 am

So, while the server was busy, it stopped sending a ping, is that normal? And after 10 seconds, the client decided to re-establish the connection. That's probably because WebSocket or ActionCable layers are not aware of the application, which was waiting for a response. In fact, I don't think even StimulusReflex has a concept of waiting for a response. If that's the case, maybe we should reconsider what we use it for.

mkllnk commented 6 months ago

while the server was busy, it stopped sending a ping, is that normal?

I don't know. It's very disappointing. The server has two workers (processes) with three threads each. It should be able to serve 6 requests in parallel. We got 8 CPUs, some of which may also be busy with the database or Redis but I would have expected the server to process a simple ping within a second. Can we extend the timeout?

mariocarabotta commented 6 months ago

from now onwards this issue is going to be related to Background Reports only. @dacook will create a separate issue for BUU

mkllnk commented 6 months ago

Summary of identified problems

Large on-screen payloads don't reach the browser

We found that generating a large report to show on the screen results in the infinite loader. The HTML can be several megabytes in size and is sent from the report job but then disappears somewhere. We don't know what the limit is and which layer drops it (cable_ready, ActionCable or Redis?). But a 4MB report is dropped. Requesting the same report in a download format sends only the link to the report to the browser and that works.

The easiest workaround would be to check for the size of the HTML and only link to the blob like with downloadable reports. Quickest solution is to just display a link like with the other reports. Next, we could render HTML with an iframe or load the content with JS as in element.innerHTML = await (await fetch(url)).json (not tested).

A busy server results in unstable web sockets

When the server is very busy, for example compiling a big report, it stops sending pings to the browser. The client has a timeout of 10 seconds after which the connection is closed and it tries to open a new one. During this time, messages are lost and don't reach the browser. ActionCable is a simple broadcasting service and discards messages when nobody is listening. The client timeout can be configured:

ActionCable.ConnectionMonitor.staleThreshold = 10;

We could also increase the number of threads handling web sockets:

config.action_cable.worker_pool_size = 4

We also need to increase the number of database connections for this one.

dacook commented 6 months ago

Great summary, thanks Maikel. Just one comment:

The client timeout can be configured:

True, but that may only put off the problem until later, when it might occur less predictably and be harder to track. Also worth noting that's an unsupported config, as noted on that post:

Please note that this approach is generally a really bad idea: messing with internal variables is one of the direct ways to bugs and issues.

mkllnk commented 5 months ago

On the problem of report size on screen:

Previously I found that a 4MB report never reached the browser when testing in production.
Locally, I need to go to 30MB to make it fail.
In specs, it fails earlier but some of it may be timeouts during HTML rendering. A request spec using websockets would be good but I don't know how to do that.

So finding the limitation and solving it may be tricky. The limitation may be in the browser, or in ActionCable or in cable_ready. Instead, it would be much easier to offer a download link.

RachL commented 4 months ago

copying a comment from Bethan on slack:

3-4 complaints each month that reports haven't 'loaded' when they've actually been emailed - this inevitably results in lots of emails though as they try and regenerate the report multiple times (one customer yesterday had 7 'report ready' emails all in the same minute!). It might be helpful to have a 'your report is being generated and will be emailed to you shortly' message in the generated report section when the email process is initiated, to minimise users clicking 'go' lots...

mkllnk commented 4 months ago

Cool, I have a fix for one part of the problem. I'll create a PR next week to support displaying big reports on the screen. Then we just have the unreliable websockets connection to think about...

openfoodfoundation / openfoodnetwork