openfoodfoundation / openfoodnetwork

Connect suppliers, distributors and consumers to trade local produce.
https://www.openfoodnetwork.org
GNU Affero General Public License v3.0
1.12k stars 723 forks source link

Cannot search and select products to add to Subscriptions #12714

Closed BethanOFN closed 2 months ago

BethanOFN commented 3 months ago

Description

User flagged that whilst trying to create a new subscription, the required products were not appearing in the dropdown when searched and the screen stalled on an infinite loader. As such, they could not create the subscription. When testing this I replicated it on my end both with a new subscription and trying to add new items to an existing subscription.

Expected Behavior

When a product name is searched for, all products containing that name should appear in the dropdown to allow selection for the subscription

Actual Behaviour

When creating a new subscription or editing an existing subscription, no products are returned when a search term is entered. This means that subscriptions cannot be used as the dropdown only displays 'searched for' products, not all products.

Steps to Reproduce

  1. Go to Subscriptions
  2. Create a new subscription/edit an existing subscription
  3. Search for an existing product
  4. Observe infinite loader

Animated Gif/Screenshot

See 500 error in GIF below when searching product: Subscriptions not finding product

Workaround

No workaround - I have tried as superadmin and also from my very small test shop (as identifying user is a large hub) and nothing has worked

Severity

bug-s2

Your Environment

Possible Fix

EdwardLi-coder commented 3 months ago

Hi @BethanOFN . I want to resolve this error,Can you provide more info about this?Like the response?

chahmedejaz commented 3 months ago

Hi @EdwardLi-coder - Thanks for your interest in this issue. I don't think the network response would be of much help here as it's a server side error. It would be great if we could reproduce this in local environment and then analyze the error on the server side. Please let me know if you have any question. Thanks.

EdwardLi-coder commented 3 months ago

Hi @EdwardLi-coder - Thanks for your interest in this issue. I don't think the network response would be of much help here as it's a server side error. It would be great if we could reproduce this in local environment and then analyze the error on the server side. Please let me know if you have any question. Thanks.

Due to the 500 error response, I believe this might contain information about it. I attempted to reproduce this error on my local computer but was unable to do so.

dacook commented 3 months ago

We track all application errors with Bugsnag, so I checked there and found that it only happened during one time:

Image (https://app.bugsnag.com/yaycode/openfoodnetwork-uk/errors?filters[event.since]=30d&filters[search]=search.json&sort=last_seen)

So I think it's safe to say

So I think we can close this issue

BethanOFN commented 3 months ago

Still seeing infinite loading and failure to return products, so I'm reopening - no longer a 500 error, now it's no connection to server:

Screenshot 2024-08-13 at 12 28 27
chahmedejaz commented 3 months ago

Still seeing infinite loading and failure to return products, so I'm reopening - no longer a 500 error, now it's no connection to server:

Screenshot 2024-08-13 at 12 28 27

Hi @BethanOFN - Just to be sure it's not browser related, can you please validate the same case in Chrome as well? Thanks.

BethanOFN commented 3 months ago

@chahmedejaz sure 👍 on chrome:

Screenshot 2024-08-13 at 13 50 45
chahmedejaz commented 3 months ago

Thanks for the confirmation, Bethan. Looking at the errors, seems like they are on the server level and not on the code level. Any of @openfoodfoundation/core-devs may look into it in this case. Thanks.

By the way, UK staging is also giving this error and it's not even opening the site.

dacook commented 2 months ago

I'll look into this and see what I can find.

By the way, UK staging is also giving this error and it's not even opening the site.

Yes, I am working on that and broke it.. 😞

dacook commented 2 months ago

I found a correlation between using this subscriptions search, and an increased load on the server, which led to the 504 timeout errors. Here are some examples:

ofn-admin@production18:/var/log/nginx$ grep -P -o '13/Aug/2024:11:.*search.json.*?" '  access.log
13/Aug/2024:11:17:28 +0000] "GET /admin/variants/search.json?q=kefir&distributor_id=160&eligible_for_subscriptions=true&include_out_of_stock=&_=1723547826470 HTTP/2.0"
...
13/Aug/2024:11:22:27 +0000] "GET /admin/variants/search.json?q=kefir&distributor_id=160&eligible_for_subscriptions=true&include_out_of_stock=&_=1723548065530 HTTP/2.0"
...
13/Aug/2024:11:29:02 +0000] "GET /admin/variants/search.json?q=goats+milk+kefir&distributor_id=160&eligible_for_subscriptions=true&include_out_of_stock=&_=1723548065535 HTTP/2.0"
... (total 20 over a period of 20 minutes)
ofn-admin@production18:/var/log/nginx$ grep  '2024/08/13 11:'  error.log
2024/08/13 11:22:27 [error] 1636#1636: *501596 upstream prematurely closed connection while reading response header from upstream, client: 92.14.44.0, server: openfoodnetwork.org.uk, request: "GET /admin/variants/search.json?q=kefir&distributor_id=160&eligible_for_subscriptions=true&include_out_of_stock=&_=1723548065530 HTTP/2.0"
...
2024/08/13 11:29:02 [error] 1634#1634: *501844 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 92.14.44.0, server: openfoodnetwork.org.uk, request: "GET /admin/variants/search.json?q=goats+milk+kefir&distributor_id=160&eligible_for_subscriptions=true&include_out_of_stock=&_=1723548065535 HTTP/2.0"

Screenshot 2024-08-14 at 11 33 05 am (ref)(time is reported in different timezone, just look at the minutes)

Given the sustained high CPU usage, I wondered if it was due to a report. But I could find no reports were generated at that time, and no long-running jobs in sidekiq.

I also noticed calls to /api/v0/order_cycles/*/products.json at this time, but it seems normal. There were 313 requests made over that hour.

So I'm guessing that search.json is a very intensive process. It's possible that it takes so long that people retry, which can exacerbate the problem. I had a look at the code, it is surprisingly convoluted. I'm unclear on exactly how it works, but it's possible the query is not scoped to the distributor, and therefore it could be processing large numbers of variant records unnecessarily.

dacook commented 2 months ago

Ok, yep it's the search process. I can see it timing out after the allotted 2 minutes, each time it is called. For example: (thanks @rioug for adding the request ID to the logs, this was really helpful!)

ofn-admin@production18:/home/openfoodnetwork/apps/openfoodnetwork/current/log$ egrep "(Completed 504 Gateway Timeout)|(search.json)" production.log

I, [2024-08-13 11:17:20 #7294]  INFO -- : [148e245e-bfaf-441f-8959-a1b088166ace] Started GET "/admin/variants/search.json?q=kefir&distributor_id=160&eligible_for_subscriptions=true&include_out_of_stock=&_=1723547826470" for 92.14.44.0 at 2024-08-13 11:17:20 +0000
...
I, [2024-08-13 11:19:20 #7294]  INFO -- : [148e245e-bfaf-441f-8959-a1b088166ace] Completed 504 Gateway Timeout in 119995ms (Views: 0.6ms | ActiveRecord: 116259.6ms | Allocations: 10756109)

I think the first thing to do is to try optimising it.

dacook commented 2 months ago

FYI @BethanOFN I believe this is performing so poorly due to the immense number of Enterprise Permissions granted to the hub (Tamar Food Valley). I don't know if it's possible, but if there's any old ones that you can remove, it might help.

dacook commented 2 months ago

Unfortunately I'm having trouble making any meaningful optimisations at this stage.

I've tried this search.json command on my computer, using a copy of the same database, and although it's not fast, it's performing relatively well. (compared to loading any page as a super admin which is excruciatingly slow... I think due to the large number of hubs and order cycles on the system)

So I'm actually not sure now why this would be causing the issue. 🤔

dacook commented 2 months ago

I think I will look again at what changes in Product Refactor could have caused it. Hmm or if it's worth increasing server resources.

BethanOFN commented 2 months ago

FYI @BethanOFN I believe this is performing so poorly due to the immense number of Enterprise Permissions granted to the hub (Tamar Food Valley). I don't know if it's possible, but if there's any old ones that you can remove, it might help.

Thanks @dacook, yes they're the biggest hub on UK Prod! I'll pull up a query for any outdated permissions and see what I can do

dacook commented 2 months ago

I tried to replicate the issue on uk_prod (as superuser), but wasn't able to.

Increasing server CPU or memory would probably help a bit, but as I can't replicate on prod, I can't really say for sure.

dacook commented 2 months ago

@BethanOFN sorry for the delay on this one. I'm just wondering if this is still an issue for you?

BethanOFN commented 2 months ago

@dacook sorry been on leave! I ran a query on enterprise permissions and sent it to the user with suggested deletions for inactive enterprises - deleting a lot of them resolved the issue for the user, so this can be closed I'd say given no one else has reported it and they are the biggest user on UK instance

dacook commented 2 months ago

Hope you had a good break! No worries, glad it's no longer an issue.