Closed sauloperez closed 3 years ago
I made a PR around this recently, it was merged: https://github.com/openfoodfoundation/openfoodnetwork/pull/6701
Maybe it's a separate issue? :thinking:
Further investigation.
This is something that is affecting all actions regardless of being read or write-type action, so it could be the initial data fetching that all require.
Also, in all flamegraphs I checked the longest span is an activerecord.instantiation Spree::PaymentMethod
, with N calls to fetch individual calculators.
Datadog reports 553 instantiated Spree::PaymentMethod
objects and there are 725 in total in CA's DB so this could really be the result of
Ah, I think that flame graph is from before the release that contained #6701 was deployed...
yes, could be. Let me find a more recent one.
A couple of weeks ago it was taking up to 53 seconds, now it's between 0.1 and 6 seconds.
Early January:
This week:
P90 moved from 21 seconds to 0.8 seconds :tada:
I couldn't find any slow trace related to this controller. The top 5 slowest requests are:
I could only guess that payments during checkout are what we're talking about but we need @tschumilas to answer the following questions before this forward:
The requests we've received are from hubs. 3 hubs have said its slow to process payments. But I wonder if that is misleading. I suspect that processing payments on orders is the primary action these hubs are doing. They are not likely doing other admin tasks like creating order cycles, payment/delivery methods, since they are all set up already. Yesterday I was doing a demo to a farm, and I was showing all the set up processes. This was a newly created enterprise with only 1 product, and no other associated permissions, and no orders. I was counting 10 - 15 seconds everytime I switched pages - so from dashboard to payment methods, back to dashboard, from dashbaord to orders...... I wasn't processing payments at all. At the time I thought it might just be my internet - but my icon showed good signal strength. (Sorry if this is all irrelevant).
I do not think this is affecting automated payments. It is affecting manual payments that the hubs need to make on the orders page. I set up a subscription for example with a stripe payment yesterday, and it seemed to process fine. We are still using Stripe Connect. (So far my attempts with Stripe SCA have not been successful here - but I think that is a separate issue - banks here may not be accepting Stripe SCA - I'm still testing that.)
If its helpful - the one big hub that experiences this processes quite a lot of payments at their pickup. Baileys Local Foods - UID 80. They are a legacy hub here - and very supportive, so I can ask them to monitor specific things if that is helpful. Their current order cycle will close on Thursday Feb 4 (this week). Their pickup will be Monday Feb 8 - starting 3:00 ish EST. They adjust all orders and process all payments manually while customers wait at pickup. The last 2 pickups they have reported the slowness.
Oh - forgot to say - we have received no comments about slowness at the cart/checkout stages. I have placed test orders in several places, and that seems fine to me. No slowness. No hanging.
@tschumilas to understand a bit more what "processing payments" mean for the 3 hubs: they are adding adjustments and then validating payments or they really just are editing / adding / deleting products and then capturing payments?
I am not completely sure what steps we call adjustments. They will be - adding/deleting to an order. Then making a payment (payments - new payment), then validating that new payment. Thats where they say it hangs. BUT - I just have done that with a few test stores and it is not slow. Indeed, at the moment, nothing is slow. But I promise - yesterday at 3:00 - 4:00 EST it was impossibly slow to do anything in admin. All the hubs that have reported slowness making payments - all reported around the same time too. So - whatever it is, it seems transient. I'm fine if we just get these hubs to document their experience more clearly, and I can do that too -- before we invest more time in this.
it seems to me then that we might be facing two separate things although surely related.
Yesterday I was doing a demo to a farm, and I was showing all the set up processes. This was a newly created enterprise with only 1 product, and no other associated permissions, and no orders. I was counting 10 - 15 seconds everytime I switched pages - so from dashboard to payment methods, back to dashboard, from dashbaord to orders...... I wasn't processing payments at all
and
Then making a payment (payments - new payment), then validating that new payment. That's where they say it hangs.
Did this hub you mention in https://github.com/openfoodfoundation/openfoodnetwork/issues/6802#issuecomment-772624877 (is it Baileys Local Foods - UID 80 ?) complain about the slowness of all /admin pages? or just saving a new payment @tschumilas ? let's see if we manage to clearly separate the two symptoms and find their root cause.
So Maryrose (manages thub #80 - BAiley's Local Foods) says, "Slowness was everywhere Monday - moving between menus, loading orders, order cycles... This seemed different - the previous week it was just slow when I tried to make a mannual payment on an order. But payments got quicker after last week's upgrade." So my take is the upgrade might have fixed the slowness when making a manual payment on an order, and maybe you are right @sauloperez and this is different from the more general admin slowness. She is referring to Monday this week. My demo that was slow was Tuesday pm. Today I am not experiencing slowness though.
Thanks to the new monitoring alert Service web-server has a high p90 latency on env:production
I configured last week we got notified of a similar situation to the one that made us create this issue. A traffic spike or simply a combination of requests to the worst endpoints we have in terms of performance led to awful large request queueing, which we should consider downtime. I leave here the traces I collected for further investigation. I did none so far.
Monitoring event: https://app.datadoghq.com/monitors/30064970?from_ts=1613231040794&to_ts=1613234202771&eval_ts=1613234202771
Traces:
Alright, so if we summarise the problem in the hopes of moving towards some kind of actionable outcome; this is a general performance issue where the server can slow down temporarily under certain conditions like high traffic and users hitting our worst-performing pages and endpoints. I don't think we have a solid idea about which exact parts of the app cause(d) this, but rather it's probably a combination of various parts, and it's probably not specific to Canada especially. Is that a fair assessment?
So then the question is: what do we want to do about it? Prioritise a bit of performance blitz?
If we go by the top 3 in terms of P50, we get:
Or if we go top 3 for total time, we get:
Maybe we could take a rummage around and see if there's any quick wins to be had in the above places...?
I don't think we have a solid idea about which exact parts of the app cause(d) this, but rather it's probably a combination of various parts, and it's probably not specific to Canada especially.
I think the best we can do now is to explain that in the delivery train meeting today + spend 30min digging a bit more to know which endpoints were hit during that "downtime" to see if we can correlate something. As for deliverable, I'd enable Nginx log management for CA in Datadog to aid us on that. I believe it's off, currently.
Possibly related: On the US server openfoodnetwork.net I am noticing a lag of 10-15 seconds when going to /admin from the main screen. This is new for us (possible coincidence that we just moved to DO where CAN is). Still exploring, but wanted to note this new-to-us experience.
is that as super-admin @lauriewayne ? if it is, it is a product of you now having more enterprises which starts to slow the main super-admin admin page (the listing of enterprises). It has never been prioritised as it only affects super-admin and we don't rate :) what we all do is just never use that page, log in as admin and then go straight to one of the other pages / orders or /reports. I have a direct link saved that takes me into an individual shop settings, from which I can easily switch to any other shop https://openfoodnetwork.org.au/admin/enterprises/100-mile-foodie/edit#/primary_details
@kirstenalarsen yes it is as super-admin - this is a (perceived) change from just a few days ago vs today - I don't remember seeing the "about:blank" intermediary page ever before the last day or so.
From delivery train today:
Possible way forward: increase the number of unicorn workers and make it a config in ofn-install. Something to try out, maybe at least bump it to 3 workers like AU?
Add more logs and monitoring to understand where the problem comes from
Bring the result of the spike to product curation so we can decide what to do next
We need a few more days to compare metrics but I'm not seeing any spike in https://app.datadoghq.com/apm/service/web-server/http_server.queue?end=1613581051593&env=production&fullscreen_end_ts=1613581064106&fullscreen_paused=false&fullscreen_section=overview&fullscreen_start_ts=1613408264106&paused=false&start=1612976251593 I'm hoping that increasing the number of workers was enough :crossed_fingers: :crossed_fingers: :crossed_fingers:
I moved to Dev Test Ready, the test being to keep an eye on it for a few more days, and if we don't see any more spikes, close the issue. Does that sound right?
yep, I'd wait for a couple more days so we cover a whole week. We applied the change late on February 16th so let's see if the web-server
service (Nginx's request queuing) keeps under control and not like it used to spike before: https://app.datadoghq.com/apm/service/web-server/http_server.queue?end=1613981589041&env=production&fullscreen_end_ts=1613981240000&fullscreen_paused=true&fullscreen_section=overview&fullscreen_start_ts=1612771640000&paused=false&start=1612771989041
FYI - today (feb 22) is a Monday - and Mondays, when a couple of larger hubs close and do order management have been provacative times on OFN-CAN. So we can check around 2:00 - 5:00 EST today..... I'm so excited that you guys are on this!!! Big thanks
Cool! We'll keep an eye. The measures we took should surely have an impact but this won't magically solve reports' issues.
Confirmed. We considerably mitigated the request queueing although we still have slow requests due to running out of workers. However, it seems that the situation is better.
:point_up: the bar roughly indicates the date of the change
We didn't add the Nginx logs monitoring we meant in
- Add more logs and monitoring to understand where the problem comes from
But I think service traces provided enough.
Description
This stems from https://openfoodnetwork.slack.com/archives/CDLKH9MM0/p1612209641022100.
Although we doubled CA production's RAM from 4Gb to 8Gb on January 25th, we keep getting reports of users complaining about the slowness of the app. Specifically, around payments. These delays make it impossible to use the payments screen (15s or more).
Digging a bit more I noticed this is affecting also UK and FR, so it's an issue with the app likely aggravated by what's being changed in https://github.com/openfoodfoundation/ofn-install/pull/700?
Investigating the traces' flame graphs should shed some light on this, https://app.datadoghq.com/apm/trace/2402541302811980505?spanID=3158627724379369164&env=production&sort=time&colorBy=service&spanViewType=metadata&graphType=flamegraph&shouldShowLegend=true for example. It looks like an N+1 but it feels like we only endlessly instantiate
Spree::Calculator
objects :thinking: ?Expected Behavior
most of these requests should have a latency by the order of milliseconds.
Actual Behaviour
Sometimes the p50 of
Spree::Admin::PaymentsController
jumps up to ~21s. The p95 goes to nearly 40s :scream:UK's and FR's larger latencies go up to 22s and 10s respectively, so CA more than doubles this.
Steps to Reproduce
TBD
Animated Gif/Screenshot
The problem seems to be isolated and scope on payments
These latencies are a complete anomaly.
And affecting all requests in equal measure. The issue doesn't seem to be related to the size of the hub's data.
Response times don't even follow a known distribution.
Workaround
None
Severity
This would be an S1 because it's affecting payments but it's not happening always and may not affect all kinds of payments in the platform, so we can leave it as an S2.