Open CharlesNepote opened 1 year ago
As @CharlesNepote asked me about it, it has no impact on Robotoff, which uses directly the MongoDB DB + the daily JSONL export.
There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).
Also we must advertise the change on the /data page.
Otherwise, it's really cool to have it :-)
I propose to make it in several steps:
All fine!
$ time wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
--2023-01-27 16:59:39-- https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
Resolving static.openfoodfacts.org (static.openfoodfacts.org)... 213.36.253.206
Connecting to static.openfoodfacts.org (static.openfoodfacts.org)|213.36.253.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6435921379 (6.0G) [application/octet-stream]
Saving to: 'openfoodfacts-mongodbdump.gz'
openfoodfacts-mongodbdump.gz 100%[=====================================================================================================>] 5.99G 79.0MB/s in 1m 40s
2023-01-27 17:01:18 (61.5 MB/s) - 'openfoodfacts-mongodbdump.gz' saved [6435921379/6435921379]
real 1m39.910s
$ wget https://static.openfoodfacts.org/data/gz-sha256sum
$ time sha256sum --check gz-sha256sum
openfoodfacts-mongodbdump.gz: OK
real 0m29.388s
$ time mongorestore --drop --gzip --archive="openfoodfacts-mongodbdump.gz"
2023-01-27T17:43:08.031+0000 preparing collections to restore from
2023-01-27T17:43:08.067+0000 reading metadata for off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:08.069+0000 dropping collection off.products before restoring
2023-01-27T17:43:08.153+0000 restoring off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:10.982+0000 off.products 162MB
[...]
2023-01-27T18:23:08.270+0000 2784589 document(s) restored successfully. 0 document(s) failed to restore.
real 40m0.322s
I have made other tests: it's working perfectly well.
We have to pay attention to #8050, to be sure it does not impact too much on export global workflow.
@alexgarel: you mentioned:
There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).
Could you do it? I'm not very comfortable with makefiles... It seems that the github action is starting at 00:00 so it uses the archive from the previous day and you don't need to take care about the creation time of the file (see #8050).
When it's done, I suggest we test it during a few days before going further.
This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts
@CharlesNepote @alexgarel I think it's time to remove the old mongodb dump: https://github.com/openfoodfacts/openfoodfacts-server/pull/9946
MongoDB dump weights 39GB+ while compressed file weights 6GB+.
Currently, it is necessary to uncompress the compressed file to restore it. Thus, a simple restoration of the Open Food Facts DB is needing (6 + 39) + 39 = 84GB.
But MongoDB allows to dump and restore compressed files without uncompress them, thanks to the --gzip and --archive arguments.
We might use it as:
We just have to change one line in mongodb_dump.sh.
Here are some benchmarks.
$ time tar cvfz mongodbdump.tar.gz dump dump/ dump/off/ dump/off/products.metadata.json dump/off/products.bson ^X real 20m16.142s user 18m57.279s sys 1m47.331s => ~23 minutes
$ ls -la -rw-r--r-- 1 root root 6535232274 Jan 7 22:52 mongodbdump.tar.gz
$ time mongorestore --drop ./dump
2023-01-09T07:53:50.376+0000 preparing collections to restore from 2023-01-09T07:53:50.376+0000 reading metadata for off.products from dump/off/products.metadata.json 2023-01-09T07:53:50.379+0000 dropping collection off.products before restoring 2023-01-09T07:53:50.464+0000 restoring off.products from dump/off/products.bson [...] 2023-01-09T08:29:57.672+0000 2742914 document(s) restored successfully. 0 document(s) failed to restore.
real 36m7.309s user 2m53.516s sys 1m41.033s
$ mongodump --collection products --db off --gzip --archive="mongodump-test-db.gz" 15:37:19 writing off.products to archive 'mongodump-test-db' 16:00:18 done dumping off.products (2742914 documents) => ~23 minutes
$ ls -la -rw-r--r-- 1 root root 6484659291 Jan 7 16:00 mongodump-test-db.gz
$ time mongorestore --drop --collection products --db off --gzip --archive="mongodump-test-db.gz" 2023-01-08T15:48:27.545+0000 The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION} [...] 2023-01-08T16:25:18.788+0000 2742914 document(s) restored successfully. 0 document(s) failed to restore.
real 36m51.258s user 7m57.690s sys 1m1.246s