openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
660 stars 389 forks source link

Proposal: serve API product GET requests from an async server #10732

Open raphael0202 opened 3 months ago

raphael0202 commented 3 months ago

The performance issues we're currently experiencing led us to analyze what requests are taking most processing time on Apache server: https://docs.google.com/document/d/13rYXR0TxR2hUc0XEKzKcBT6ndcd5_L3yeP_L6UjZwzs/edit. The analysis revealed that facet-related queries were the most costly.

We only have 50 Apache workers, so when most workers are busy waiting for MongoDB or off-query, we can't respond to basic GET /api/v*/products/{code} queries that only require a disk access (to fetch the sto file) and a bit of RAM to get the translations. These requests account for 15% of all requests handled by Product Opener. This route is the most-used API endpoint by our own mobile app and reusers.

My proposal would be to use a new asynchronous service (written with FastAPI in Python, for example), to handle read-only GET /api/v*/products/{code} requests.

Having a distinct service that takes care of read-only API queries would make sure that our own app (or third-party apps) won't fail even if ProductOpener does. Asynchronicity means that:

The addition of knowledge panels could also be migrated to this new service later.

I think it's a better alternative than #8934 that, while being faster (served directly by nginx), is more disk-hungry, won't be available on all products and doesn't play nicely with taxonomized fields translations.

This could also be a first step to tackle #5170. Write queries are not very common (0.25% of queries handle by Product Opener), and most of the complexity of the codebase comes from data processing/score computation associated with write queries.

That's why I think it's better to keep POST queries out of the scope of this proposal for now.

Limits

This service wouldn't account for the 53% of queries that are product HTML pages. Serving these pages through this async service would be much more difficult, as it would mean to migrate all the HTML logic there.

john-gom commented 3 months ago

We could potentially start storing the full Product JSON in Postgres. I did a POC on this a while ago (https://github.com/openfoodfacts/openfoodfacts-server/issues/8620). The main issue is the additional database space but if that is OK then having the data in a relational database would make it much easier to use different languages than Perl.

CharlesNepote commented 3 months ago

Good idea! Could we try to compare other solutions? Note I don't have a clear opinion on what's the best solution. I have tried to be objective for both solutions, but maybe I don't have sufficient knowledge to do so.

Don't hesitate to edit this table.

Static JSON + nginx Async server Comments
RAM Winner? (but what order?) - nginx is known to be very efficient on that side; but FastAPI + PostGresql seems to consume few RAM for folksonomy engine (with very low traffic, that said)
Disk usage 300k products x 100KB? = 30 MB Winner The difference is not so big, does it really matter? All these data could be in nginx cache
Performance Clear winner (x100?) - Isn't it the main issue we're facing?
Products' perimeter 300K products All products 300k products represent 75% of all requests; probably more than 1 million products are never called with the API
Functional's perimeter What about translations? Clear winner This needs to be evaluated. I don't understand the impacts. This might be the clear or even mandatory bonus for the async server. How big would be a JSON with all the translations?
Implementation Few days? ?
Complexity Winner: no new services Needs to code or deploy (and maintain) a new server
Maintenance ? ? Any idea? Not sure, but intuitively, maintaining a new server is more costly
Scalability Better/easier scalability thanks to nginx Scalability needs more code I would say JSON + nginx is a clear winner but needs to be confirmed. Eg. couldn't JSON files be stored on another server like images?
Resilience Better/easier fallbacks thanks to nginx Resilience needs more code Idem
Sustainability A bit more technical debt in Perl More technical debt, but in a more widespread language