openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
GNU Affero General Public License v3.0
632 stars 370 forks source link

Consider moving to a relational data model, like Postgres #8620

Open john-gom opened 1 year ago

john-gom commented 1 year ago

Problem

Currently the OFF data is in a lot of different places (taxonomy files, MongoDB, STO files) which makes it difficult to perform queries across the data sets.

Aggregated queries against MongoDB are also very slow and the author feels these would be considerably faster against a relational model

Proposed solution

Move to a relational model

### Tasks
- [x] Create a data model and import taxonomies
- [x] Import prodcts as JSONB initially
- [x] Create GraphQL API for queries (Postgraphile)
- [x] Create a partial relational model for Products
- [ ] Consider a relational model for external data (AGRIBALYSE and CIQUAL) and import data
- [ ] Create relationships between taxonomies, products and external data
- [x] Implement full text search
- [x] Test some example aggregations for performance
- [ ] Clean up taxonomy id normalisation
- [ ] Ability to re-export as taxonomy files
- [ ] Validate existing taxonomies
- [ ] Develop a migration plan

Part of

john-gom commented 1 year ago

POC repo is https://github.com/john-gom/openfoodfacts-data

Using NestJS as a general framework with Mikro-ORM for data modelling / migrations and Postgraphile for GraphQL support

github-actions[bot] commented 9 months ago

This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts

john-gom commented 7 months ago

I've created a script to load all product "sto" files into Postgres. Branch is issues/8620-a.

john-gom commented 7 months ago

Some products contain \u0000 in the data which is not compatible with postgres. SQL to fix was:

update revision set data = replace(data::text,'\u0000','')::json  
where code in ('04810513','3770001905075','4779030380333','4840237001946','6909995101119','7501058623256','7702024040040','7798305866775','7801620005191','7895000467013','7898142862043','8015057004453','8412600017975','9300617296614','9557789820127')
and data::text like '%\\u0000%';
john-gom commented 7 months ago

An example database has been uploaded here: https://static.openfoodfacts.org/data/pg/products.dmp

This can be restored using pg_restore

damil commented 5 months ago

That's a very interesting proposal.

Although I still haven't understood how OFF is organized, I definitely have the feeling that a regular relational model could bring many benefits like :

From what I heard during the march 2024 hackaton, there are some recursive relations within the data -- but that's not a hindrance : most database management systems support Common Table Expressions, which is the SQL way for expressing queries on recursive data.

So thanks for your work, I'm eager to look at your postgres data.