openfoodfoundation / openfoodnetwork

Connect suppliers, distributors and consumers to trade local produce.
https://www.openfoodnetwork.org
GNU Affero General Public License v3.0
1.11k stars 718 forks source link

Slowness on OFN Canada #6802

Closed sauloperez closed 3 years ago

sauloperez commented 3 years ago

Description

This stems from https://openfoodnetwork.slack.com/archives/CDLKH9MM0/p1612209641022100.

Although we doubled CA production's RAM from 4Gb to 8Gb on January 25th, we keep getting reports of users complaining about the slowness of the app. Specifically, around payments. These delays make it impossible to use the payments screen (15s or more).

Digging a bit more I noticed this is affecting also UK and FR, so it's an issue with the app likely aggravated by what's being changed in https://github.com/openfoodfoundation/ofn-install/pull/700?

Investigating the traces' flame graphs should shed some light on this, https://app.datadoghq.com/apm/trace/2402541302811980505?spanID=3158627724379369164&env=production&sort=time&colorBy=service&spanViewType=metadata&graphType=flamegraph&shouldShowLegend=true for example. It looks like an N+1 but it feels like we only endlessly instantiate Spree::Calculator objects :thinking: ?

Expected Behavior

most of these requests should have a latency by the order of milliseconds.

Actual Behaviour

Sometimes the p50 of Spree::Admin::PaymentsController jumps up to ~21s. The p95 goes to nearly 40s :scream:

UK's and FR's larger latencies go up to 22s and 10s respectively, so CA more than doubles this.

Steps to Reproduce

TBD

Animated Gif/Screenshot

The problem seems to be isolated and scope on payments

Screenshot from 2021-02-03 12-58-38 These latencies are a complete anomaly.

Screenshot from 2021-02-03 12-59-39 And affecting all requests in equal measure. The issue doesn't seem to be related to the size of the hub's data.

Screenshot from 2021-02-03 13-03-08 Response times don't even follow a known distribution.

Workaround

None

Severity

This would be an S1 because it's affecting payments but it's not happening always and may not affect all kinds of payments in the platform, so we can leave it as an S2.

Matt-Yorkley commented 3 years ago

I made a PR around this recently, it was merged: https://github.com/openfoodfoundation/openfoodnetwork/pull/6701

Maybe it's a separate issue? :thinking:

sauloperez commented 3 years ago

Further investigation.

This is something that is affecting all actions regardless of being read or write-type action, so it could be the initial data fetching that all require.

Also, in all flamegraphs I checked the longest span is an activerecord.instantiation Spree::PaymentMethod, with N calls to fetch individual calculators.

Screenshot from 2021-02-03 13-53-09

Datadog reports 553 instantiated Spree::PaymentMethod objects and there are 725 in total in CA's DB so this could really be the result of

https://github.com/openfoodfoundation/openfoodnetwork/blob/79f98bdfa7ee4da79c4cf92231791f2512286a9d/app/controllers/spree/admin/payments_controller.rb#L113-L128

Matt-Yorkley commented 3 years ago

Ah, I think that flame graph is from before the release that contained #6701 was deployed...

sauloperez commented 3 years ago

yes, could be. Let me find a more recent one.

Matt-Yorkley commented 3 years ago

A couple of weeks ago it was taking up to 53 seconds, now it's between 0.1 and 6 seconds.

Matt-Yorkley commented 3 years ago

Early January: Screenshot from 2021-02-03 13-04-31

This week: Screenshot from 2021-02-03 13-04-40

P90 moved from 21 seconds to 0.8 seconds :tada:

sauloperez commented 3 years ago

I couldn't find any slow trace related to this controller. The top 5 slowest requests are:

Screenshot from 2021-02-03 14-02-48

I could only guess that payments during checkout are what we're talking about but we need @tschumilas to answer the following questions before this forward:

tschumilas commented 3 years ago

The requests we've received are from hubs. 3 hubs have said its slow to process payments. But I wonder if that is misleading. I suspect that processing payments on orders is the primary action these hubs are doing. They are not likely doing other admin tasks like creating order cycles, payment/delivery methods, since they are all set up already. Yesterday I was doing a demo to a farm, and I was showing all the set up processes. This was a newly created enterprise with only 1 product, and no other associated permissions, and no orders. I was counting 10 - 15 seconds everytime I switched pages - so from dashboard to payment methods, back to dashboard, from dashbaord to orders...... I wasn't processing payments at all. At the time I thought it might just be my internet - but my icon showed good signal strength. (Sorry if this is all irrelevant).

I do not think this is affecting automated payments. It is affecting manual payments that the hubs need to make on the orders page. I set up a subscription for example with a stripe payment yesterday, and it seemed to process fine. We are still using Stripe Connect. (So far my attempts with Stripe SCA have not been successful here - but I think that is a separate issue - banks here may not be accepting Stripe SCA - I'm still testing that.)

If its helpful - the one big hub that experiences this processes quite a lot of payments at their pickup. Baileys Local Foods - UID 80. They are a legacy hub here - and very supportive, so I can ask them to monitor specific things if that is helpful. Their current order cycle will close on Thursday Feb 4 (this week). Their pickup will be Monday Feb 8 - starting 3:00 ish EST. They adjust all orders and process all payments manually while customers wait at pickup. The last 2 pickups they have reported the slowness.

tschumilas commented 3 years ago

Oh - forgot to say - we have received no comments about slowness at the cart/checkout stages. I have placed test orders in several places, and that seems fine to me. No slowness. No hanging.

RachL commented 3 years ago

@tschumilas to understand a bit more what "processing payments" mean for the 3 hubs: they are adding adjustments and then validating payments or they really just are editing / adding / deleting products and then capturing payments?

tschumilas commented 3 years ago

I am not completely sure what steps we call adjustments. They will be - adding/deleting to an order. Then making a payment (payments - new payment), then validating that new payment. Thats where they say it hangs. BUT - I just have done that with a few test stores and it is not slow. Indeed, at the moment, nothing is slow. But I promise - yesterday at 3:00 - 4:00 EST it was impossibly slow to do anything in admin. All the hubs that have reported slowness making payments - all reported around the same time too. So - whatever it is, it seems transient. I'm fine if we just get these hubs to document their experience more clearly, and I can do that too -- before we invest more time in this.

sauloperez commented 3 years ago

it seems to me then that we might be facing two separate things although surely related.

Yesterday I was doing a demo to a farm, and I was showing all the set up processes. This was a newly created enterprise with only 1 product, and no other associated permissions, and no orders. I was counting 10 - 15 seconds everytime I switched pages - so from dashboard to payment methods, back to dashboard, from dashbaord to orders...... I wasn't processing payments at all

and

Then making a payment (payments - new payment), then validating that new payment. That's where they say it hangs.

Did this hub you mention in https://github.com/openfoodfoundation/openfoodnetwork/issues/6802#issuecomment-772624877 (is it Baileys Local Foods - UID 80 ?) complain about the slowness of all /admin pages? or just saving a new payment @tschumilas ? let's see if we manage to clearly separate the two symptoms and find their root cause.

tschumilas commented 3 years ago

So Maryrose (manages thub #80 - BAiley's Local Foods) says, "Slowness was everywhere Monday - moving between menus, loading orders, order cycles... This seemed different - the previous week it was just slow when I tried to make a mannual payment on an order. But payments got quicker after last week's upgrade." So my take is the upgrade might have fixed the slowness when making a manual payment on an order, and maybe you are right @sauloperez and this is different from the more general admin slowness. She is referring to Monday this week. My demo that was slow was Tuesday pm. Today I am not experiencing slowness though.

sauloperez commented 3 years ago

Thanks to the new monitoring alert Service web-server has a high p90 latency on env:production I configured last week we got notified of a similar situation to the one that made us create this issue. A traffic spike or simply a combination of requests to the worst endpoints we have in terms of performance led to awful large request queueing, which we should consider downtime. I leave here the traces I collected for further investigation. I did none so far.

Monitoring event: https://app.datadoghq.com/monitors/30064970?from_ts=1613231040794&to_ts=1613234202771&eval_ts=1613234202771

Traces:

Matt-Yorkley commented 3 years ago

Alright, so if we summarise the problem in the hopes of moving towards some kind of actionable outcome; this is a general performance issue where the server can slow down temporarily under certain conditions like high traffic and users hitting our worst-performing pages and endpoints. I don't think we have a solid idea about which exact parts of the app cause(d) this, but rather it's probably a combination of various parts, and it's probably not specific to Canada especially. Is that a fair assessment?

So then the question is: what do we want to do about it? Prioritise a bit of performance blitz?

If we go by the top 3 in terms of P50, we get: Screenshot from 2021-02-15 09-57-52

Or if we go top 3 for total time, we get: Screenshot from 2021-02-15 09-57-56

Maybe we could take a rummage around and see if there's any quick wins to be had in the above places...?

sauloperez commented 3 years ago

I don't think we have a solid idea about which exact parts of the app cause(d) this, but rather it's probably a combination of various parts, and it's probably not specific to Canada especially.

I think the best we can do now is to explain that in the delivery train meeting today + spend 30min digging a bit more to know which endpoints were hit during that "downtime" to see if we can correlate something. As for deliverable, I'd enable Nginx log management for CA in Datadog to aid us on that. I believe it's off, currently.

lauriewayne commented 3 years ago

Possibly related: On the US server openfoodnetwork.net I am noticing a lag of 10-15 seconds when going to /admin from the main screen. This is new for us (possible coincidence that we just moved to DO where CAN is). Still exploring, but wanted to note this new-to-us experience.

kirstenalarsen commented 3 years ago

is that as super-admin @lauriewayne ? if it is, it is a product of you now having more enterprises which starts to slow the main super-admin admin page (the listing of enterprises). It has never been prioritised as it only affects super-admin and we don't rate :) what we all do is just never use that page, log in as admin and then go straight to one of the other pages / orders or /reports. I have a direct link saved that takes me into an individual shop settings, from which I can easily switch to any other shop https://openfoodnetwork.org.au/admin/enterprises/100-mile-foodie/edit#/primary_details

lauriewayne commented 3 years ago

@kirstenalarsen yes it is as super-admin - this is a (perceived) change from just a few days ago vs today - I don't remember seeing the "about:blank" intermediary page ever before the last day or so.

RachL commented 3 years ago

From delivery train today:

  1. Possible way forward: increase the number of unicorn workers and make it a config in ofn-install. Something to try out, maybe at least bump it to 3 workers like AU?

  2. Add more logs and monitoring to understand where the problem comes from

  3. Bring the result of the spike to product curation so we can decide what to do next

sauloperez commented 3 years ago

We need a few more days to compare metrics but I'm not seeing any spike in https://app.datadoghq.com/apm/service/web-server/http_server.queue?end=1613581051593&env=production&fullscreen_end_ts=1613581064106&fullscreen_paused=false&fullscreen_section=overview&fullscreen_start_ts=1613408264106&paused=false&start=1612976251593 I'm hoping that increasing the number of workers was enough :crossed_fingers: :crossed_fingers: :crossed_fingers:

andrewpbrett commented 3 years ago

I moved to Dev Test Ready, the test being to keep an eye on it for a few more days, and if we don't see any more spikes, close the issue. Does that sound right?

sauloperez commented 3 years ago

yep, I'd wait for a couple more days so we cover a whole week. We applied the change late on February 16th so let's see if the web-server service (Nginx's request queuing) keeps under control and not like it used to spike before: https://app.datadoghq.com/apm/service/web-server/http_server.queue?end=1613981589041&env=production&fullscreen_end_ts=1613981240000&fullscreen_paused=true&fullscreen_section=overview&fullscreen_start_ts=1612771640000&paused=false&start=1612771989041

tschumilas commented 3 years ago

FYI - today (feb 22) is a Monday - and Mondays, when a couple of larger hubs close and do order management have been provacative times on OFN-CAN. So we can check around 2:00 - 5:00 EST today..... I'm so excited that you guys are on this!!! Big thanks

sauloperez commented 3 years ago

Cool! We'll keep an eye. The measures we took should surely have an impact but this won't magically solve reports' issues.

sauloperez commented 3 years ago

Confirmed. We considerably mitigated the request queueing although we still have slow requests due to running out of workers. However, it seems that the situation is better.

Screenshot from 2021-03-01 11-12-33

:point_up: the bar roughly indicates the date of the change

We didn't add the Nginx logs monitoring we meant in

  1. Add more logs and monitoring to understand where the problem comes from

But I think service traces provided enough.