openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
661 stars 389 forks source link

Allow to restore MongoDB dump without uncompressing it #7962

Open CharlesNepote opened 1 year ago

CharlesNepote commented 1 year ago

MongoDB dump weights 39GB+ while compressed file weights 6GB+.

Currently, it is necessary to uncompress the compressed file to restore it. Thus, a simple restoration of the Open Food Facts DB is needing (6 + 39) + 39 = 84GB.

But MongoDB allows to dump and restore compressed files without uncompress them, thanks to the --gzip and --archive arguments.

mongodump --gzip --archive="mongodump-test-db.gz" --db=off --collection products
mongorestore --gzip --archive="mongodump-test-db.gz"

We might use it as:

We just have to change one line in mongodb_dump.sh.

Here are some benchmarks.

  1. Current situation.
    
    $ time mongodump --collection products --db off
    16:18:50.823+0000   writing off.products to dump/off/products.bson
    16:21:12.286+0000   done dumping off.products (2742914 documents)
    real    2m21.575s
    user    0m30.970s
    sys 0m57.142s

$ time tar cvfz mongodbdump.tar.gz dump dump/ dump/off/ dump/off/products.metadata.json dump/off/products.bson ^X real 20m16.142s user 18m57.279s sys 1m47.331s => ~23 minutes

$ ls -la -rw-r--r-- 1 root root 6535232274 Jan 7 22:52 mongodbdump.tar.gz

$ time mongorestore --drop ./dump
2023-01-09T07:53:50.376+0000 preparing collections to restore from 2023-01-09T07:53:50.376+0000 reading metadata for off.products from dump/off/products.metadata.json 2023-01-09T07:53:50.379+0000 dropping collection off.products before restoring 2023-01-09T07:53:50.464+0000 restoring off.products from dump/off/products.bson [...] 2023-01-09T08:29:57.672+0000 2742914 document(s) restored successfully. 0 document(s) failed to restore.

real 36m7.309s user 2m53.516s sys 1m41.033s


2. Using --gzip and --archive.

$ mongodump --collection products --db off --gzip --archive="mongodump-test-db.gz" 15:37:19 writing off.products to archive 'mongodump-test-db' 16:00:18 done dumping off.products (2742914 documents) => ~23 minutes

$ ls -la -rw-r--r-- 1 root root 6484659291 Jan 7 16:00 mongodump-test-db.gz

$ time mongorestore --drop --collection products --db off --gzip --archive="mongodump-test-db.gz" 2023-01-08T15:48:27.545+0000 The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION} [...] 2023-01-08T16:25:18.788+0000 2742914 document(s) restored successfully. 0 document(s) failed to restore.

real 36m51.258s user 7m57.690s sys 1m1.246s



@stephanegigandet, @alexgarel, @hangy, @syl10100, @cquest ?
raphael0202 commented 1 year ago

As @CharlesNepote asked me about it, it has no impact on Robotoff, which uses directly the MongoDB DB + the daily JSONL export.

alexgarel commented 1 year ago

There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).

Also we must advertise the change on the /data page.

Otherwise, it's really cool to have it :-)

CharlesNepote commented 1 year ago

I propose to make it in several steps:

CharlesNepote commented 1 year ago

All fine!

$ time wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
--2023-01-27 16:59:39--  https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
Resolving static.openfoodfacts.org (static.openfoodfacts.org)... 213.36.253.206
Connecting to static.openfoodfacts.org (static.openfoodfacts.org)|213.36.253.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6435921379 (6.0G) [application/octet-stream]
Saving to: 'openfoodfacts-mongodbdump.gz'

openfoodfacts-mongodbdump.gz                    100%[=====================================================================================================>]   5.99G  79.0MB/s    in 1m 40s  

2023-01-27 17:01:18 (61.5 MB/s) - 'openfoodfacts-mongodbdump.gz' saved [6435921379/6435921379]

real    1m39.910s

$ wget https://static.openfoodfacts.org/data/gz-sha256sum

$ time sha256sum --check gz-sha256sum
openfoodfacts-mongodbdump.gz: OK

real    0m29.388s

$ time mongorestore --drop --gzip --archive="openfoodfacts-mongodbdump.gz"
2023-01-27T17:43:08.031+0000    preparing collections to restore from
2023-01-27T17:43:08.067+0000    reading metadata for off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:08.069+0000    dropping collection off.products before restoring
2023-01-27T17:43:08.153+0000    restoring off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:10.982+0000    off.products  162MB
[...]
2023-01-27T18:23:08.270+0000    2784589 document(s) restored successfully. 0 document(s) failed to restore.

real    40m0.322s
CharlesNepote commented 1 year ago

I have made other tests: it's working perfectly well.

We have to pay attention to #8050, to be sure it does not impact too much on export global workflow.

CharlesNepote commented 1 year ago

@alexgarel: you mentioned:

There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).

Could you do it? I'm not very comfortable with makefiles... It seems that the github action is starting at 00:00 so it uses the archive from the previous day and you don't need to take care about the creation time of the file (see #8050).

When it's done, I suggest we test it during a few days before going further.

github-actions[bot] commented 11 months ago

This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts

stephanegigandet commented 8 months ago

@CharlesNepote @alexgarel I think it's time to remove the old mongodb dump: https://github.com/openfoodfacts/openfoodfacts-server/pull/9946