Is there a way to reduce the size of the database ?

openfoodfacts / openfoodfacts-query

This extracts key product data from MongoDB into a Postgres database to speed up OFF

GNU Affero General Public License v3.0

0 stars 0 forks source link

Is there a way to reduce the size of the database ? #39

Closed teolemon closed 1 week ago

teolemon commented 3 months ago

Question

Is there a way to reduce the size of the database
It's around 200GB currently, and we're at 95%+ disk usage

cquest commented 1 month ago

On off1, the current PG db size is 89GB uncompressed, 28GB compressed by ZFS.

Half of the DB storage comes from indices (44GB).

Storage space can be saved by:

replacing uuid (16 bytes) by postgresl sequences (8 bytes)
in the product_xxx_tag tables, tag values are stored as plain text, they could be stored in a separate table containing a list of tags, with N-N relation with the product table. Each row could be 24 bytes only

I did a quick test on the product_state_tag table (the largest one):

table data is reduced from 4.5GB to 2.3GB (almost 50% saved)
indices are down from 7.5GB to 3GB (60% less)

cquest commented 1 month ago

uuid (16 bytes) vs PG sequences (8 bytes) make indices larger by 1GB

Regarding query execution time:

select value, count(*) from product_states_tag group by 1; takes 15.8s

Its equivalent with values stored in a separate table takes 12.5s

select value, nb from (select tid, count(*) as nb from test_state_tag group by 1) t join test_state_tags v on (id=tid);

john-gom commented 1 month ago

Thanks @cquest. I am currently looking into loading taxonomies into off-query to allow us to join to these to get translations, which would avoid the need for off-server to translate the tags after performing the query.

The main issue with this, however, is that many products contain tags that don't have corresponding taxonomy entries, so we would need to create missing tags on-the-fly. Not difficult, but could impact loading speed.

Another thing I have considered is moving the JSON data into a separate "staging" table separate from the relational model. That way we would be able to periodically delete this data. I'm also wondering whether having the large JSON column in a separate table might improve query performance on the relational fields

john-gom commented 1 month ago

Sorry, 2 more questions:

When you say to use "sequences" are you thinking of bigserial?
Do you think storing the values in separate tables will be still worth doing for tables with large numbers of distinct values, such as contributors or codes? It might be easier to code from a consistency perspective, but I don't want to make things worse for these scenarios

cquest commented 1 month ago

For the SERIAL sequences are 4 bytes, which should be enough (billions).

Regarding large numbers of tags like editors, some cleaning maybe be use on the yuka "editors" to merge them if this is not an issue for the facets.

john-gom commented 1 month ago

I think after this I will move the tags that have related taxonomies to fetch values from a separate table (which will ultimately become the full taxonomy table). I may leave the values that are more ad-hoc (like contributors) as they are.