openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
656 stars 386 forks source link

CSV/JSONL/RDF export controlled #3857

Closed CharlesNepote closed 3 years ago

CharlesNepote commented 4 years ago

Some people are yelling that the CSV/JSONL export is sometimes not complete. Indeed, it was 200 MB two days ago, 3 GB yesterday, and 140 MB today (2020-07-16). Even if we find a bug, there WILL BE other bugs, so we shall discuss how to produce a good export at least once a week. Here is my proposal (I have tried to make it simple):

  1. export to en.openfoodfacts.org.products.YYYY-MM-DD.csv
  2. control number of CSV lines is not different by 1% than the number of products
  3. if there's no difference: move the file to en.openfoodfacts.org.products.csv
  4. if there're some differences: delete the produced file and keep the last one, and send an alert to #infrastructure slack channel ; if we have alerts two or three successive days, we will be able to see if it's important to fix it
  5. the script should return a "0" value when finishing, and it should be launched by another script/system (Minion?) to catch that it ends correctly
CharlesNepote commented 4 years ago

Related script is: https://github.com/openfoodfacts/openfoodfacts-server/blob/master/scripts/export_database.pl

stephanegigandet commented 4 years ago

Still an issue on most days... :-|

/srv/off/html/data# wc -l *csv 25940 be-fr.openfoodfacts.org.products.20190520.csv 179064 en.openfoodfacts.org.products.csv 638284 en.openfoodfacts.org.products.nutriscore.csv 2 en.openfoodfacts.org.products.test.csv 1454379 fr.openfoodfacts.org.products.csv 2 fr.openfoodfacts.org.products.test.csv

svensven commented 4 years ago

does the process die with any error message?

svensven commented 4 years ago

Looking at the timestamps of the various things gen_feeds_daily_off.sh generates, assuming that is the script being used. Tonight the CSV export seems to have died about 40 mins in. And because the french exports haven't been modified for 6 days, it's died during the english phase every night, but at different times, because the size has varied.

If it's a MongoDB problem, I was wondering if either of these might help:

However, I was wondering if it was a RAM problem instead, because perl's keeping many GB of output text in memory somehow while writing the files, and getting OOM-killed.

Tonight's gen_feeds_daily_off.sh outputs:

HEAD https://cestemballepresdechezvous.fr/ | grep Last-Modified

Last-Modified: Sat, 03 Oct 2020 00:43:07 GMT

_then ./remove_emptyproducts.pl happens

for ANEXT in csv rdf csv.gz rdf.gz ; do for ALANG in en fr; do AURL="https://static.openfoodfacts.org/data/${ALANG}.openfoodfacts.org.products.${ANEXT}" ; echo $AURL ; HEAD "$AURL" | grep -P '^(Last-Modified|Content-Length)'; echo ; done ; done

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv Content-Length: 1083205324 Last-Modified: Sat, 03 Oct 2020 01:22:38 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.rdf Content-Length: 1858356145 Last-Modified: Sat, 03 Oct 2020 01:22:38 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.csv Content-Length: 3130709124 Last-Modified: Sun, 27 Sep 2020 02:28:07 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf Content-Length: 3695268947 Last-Modified: Sun, 27 Sep 2020 02:28:23 GMT

_then ./mongodbdump.sh happens (assuming that's a successful run of the full db, it only takes 7 mins! wow.)

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.rdf.gz Content-Length: 103215778 Last-Modified: Sat, 03 Oct 2020 02:35:45 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf.gz Content-Length: 245756355 Last-Modified: Sat, 03 Oct 2020 02:36:40 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz Content-Length: 99864432 Last-Modified: Sat, 03 Oct 2020 02:36:56 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.csv.gz Content-Length: 303748197 Last-Modified: Sat, 03 Oct 2020 02:38:17 GMT

hangy commented 4 years ago

I'm wondering if setting the $cursor to immortal would help. The export process rubs for a while, and it could simply be a case of MongoDB killing the cursor.

The suggestion of not querying the database multiple times should help in any case. However, I don't think it should be a memory issue, because neither the file handles nor the MongoDB cursor should hold that much data in RAM.

svensven commented 4 years ago

an immortal cursor doesn't stop update_all_products.pl dying with a mongodb error, but this doesn't run as long as that does, so worth a try.

svensven commented 4 years ago

@stephanegigandet is there any log/output giving a hint why export_database.pl dies?

svensven commented 4 years ago

timings of a successful run, for reference:

HEAD https://cestemballepresdechezvous.fr/ | grep Last-Modified

Last-Modified: Sun, 04 Oct 2020 00:50:22 GMT

_then ./remove_emptyproducts.pl happens

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv Content-Length: 3106532546 Last-Modified: Sun, 04 Oct 2020 01:34:41 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.rdf Content-Length: 3702870862 Last-Modified: Sun, 04 Oct 2020 01:34:53 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.csv Content-Length: 3140404238 Last-Modified: Sun, 04 Oct 2020 02:02:08 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf Content-Length: 3702876306 Last-Modified: Sun, 04 Oct 2020 02:02:22 GMT

_then ./mongodbdump.sh happens (a lot longer this time)

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.rdf.gz Content-Length: 245874134 Last-Modified: Sun, 04 Oct 2020 02:51:04 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf.gz Content-Length: 245863838 Last-Modified: Sun, 04 Oct 2020 02:53:28 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz Content-Length: 301999126 Last-Modified: Sun, 04 Oct 2020 02:54:22 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.csv.gz Content-Length: 304706946 Last-Modified: Sun, 04 Oct 2020 02:57:30 GMT

svensven commented 4 years ago

The not overwriting previous CSV/RDFs seems to be working now, at least. Sunday's results:

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv Content-Length: 3121990548 Last-Modified: Sat, 10 Oct 2020 00:57:11 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.csv Content-Length: 3156356021 Last-Modified: Sat, 10 Oct 2020 01:21:43 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.rdf Content-Length: 3709872777 Last-Modified: Sat, 10 Oct 2020 00:57:23 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf Content-Length: 3709898808 Last-Modified: Sat, 10 Oct 2020 01:21:55 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz Content-Length: 303729838 Last-Modified: Sun, 11 Oct 2020 01:33:10 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.csv.gz Content-Length: 306551204 Last-Modified: Sun, 11 Oct 2020 01:35:13 GMT

https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.rdf.gz Content-Length: 246660636 Last-Modified: Sun, 11 Oct 2020 01:29:58 GMT

https://static.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf.gz Content-Length: 246650961 Last-Modified: Sun, 11 Oct 2020 01:32:18 GMT

CharlesNepote commented 3 years ago

A user is complaining by email that the export has only 1486048 products. I have verified and the last good export was on October 13th (!). It's been more than one month that the export script is failing...

charles@off1:/srv/off/html/data$ ls -l
-rw-r--r--   1 off      off            64924 Nov 17 02:18 en.openfoodfacts.org.products.bad-chars.log
-rw-r--r--   1 off      off       3129850399 Oct 13 04:09 en.openfoodfacts.org.products.csv
-rw-r--r--   1 off      root       304599475 Nov 17 03:58 en.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off      off         63871059 Nov 17 02:18 en.openfoodfacts.org.products.csv.temp

charles@off1:/srv/off/html/data$ zcat en.openfoodfacts.org.products.csv.gz | wc -l
1486048
charles@off1:/srv/off/html/data$ cat en.openfoodfacts.org.products.csv | wc -l
1486048

Should we produce the export based on openfoodfacts-products.jsonl.gz (which works), with jq tool for example? $ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name The issue is that the resulting CSV will probably be different from the one produced by the script. I can try to make benchmarks.

stephanegigandet commented 3 years ago

Here is the error:


405000 
406000 
<h1>Software error:</h1>
<pre>MongoDB::DatabaseError: operation exceeded time limit</pre>
<p>
For help, please send mail to this site's webmaster, giving this error message 
and the time and date of the error.

</p>
[Wed Nov 18 11:08:21 2020] export_database.pl: MongoDB::DatabaseError: operation exceeded time limit

real    8m18.530s
user    7m5.624s
sys 0m19.180s
stephanegigandet commented 3 years ago

Trying to set a 10 times greater timeout:

my $cursor = get_products_collection(300000)->query({'code' => { "\$ne" => "" }}, {'empty' => { "\$ne" => 1 }})->fields($fields_ref)->sort({code=>1});

stephanegigandet commented 3 years ago

It worked. :)

CharlesNepote commented 3 years ago

It didn't :( Export failed again during this night. Is there another greedy process during the night, which could explain that?

-rw-r--r--   1 off      off       3252029595 Nov 18 11:45 en.openfoodfacts.org.products.csv
-rw-r--r--   1 off      root       315823506 Nov 19 03:22 en.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off      off         17197389 Nov 19 02:18 en.openfoodfacts.org.products.csv.temp
-rw-r--r--   1 off      off       3292396606 Nov 18 12:13 fr.openfoodfacts.org.products.csv
-rw-r--r--   1 off      root       319050624 Nov 19 03:25 fr.openfoodfacts.org.products.csv.gz

Weirdly, fr.openfoodfacts.org.products.csv.temp has not been produced.

svensven commented 3 years ago

Weirdly, fr.openfoodfacts.org.products.csv.temp has not been produced.

It does english csv+rdf, then fr csv+rdf. so it presumably died during english. foreach my $l ("en", "fr") {

CharlesNepote commented 3 years ago

On 2021-02-01, the export was not working since 2020-12-16.

charles@off1:~$ ls -l /srv/off/html/data/en.o*
[...]
-rw-r--r-- 1 off off  3336085268 Dec 16 13:27 /srv/off/html/data/en.openfoodfacts.org.products.csv
-rw-r--r-- 1 off root  322410524 Jan 31 04:22 /srv/off/html/data/en.openfoodfacts.org.products.csv.gz
-rw-r--r-- 1 off off    61032089 Feb  1 02:23 /srv/off/html/data/en.openfoodfacts.org.products.csv.temp

Today, 2021-02-10, after stopping all Robotoff processes 2 days ago, the export seems to work again:

charles@off1:~$ ls -l /srv/off/html/data/en.o*
[...]
-rw-r--r-- 1 off off  3552589570 Feb 10 03:56 /srv/off/html/data/en.openfoodfacts.org.products.csv
-rw-r--r-- 1 off root  339120827 Feb  9 09:51 /srv/off/html/data/en.openfoodfacts.org.products.csv.gz
CharlesNepote commented 3 years ago

Since we have migrated Robotoff from off2 to another machine, and since we also have migrated MongoDB database on NVMe disks, export is not an issue anymore.