Generate on-demand miniaturized dump of the database and images for an offline mode

teolemon commented 2 years ago

Who for

Mobile devs
What

### Tasks
- [ ] Create an API to generate on-demand miniaturized dump of the database for an offline mode
  - [ ] Generate on-demand miniaturized dump of the database for an offline mode
  - [ ] Allow slicing on-demand miniaturized dump of the database by country, popularity, stores, language, required fields
- [ ] Create an API to generate on-demand Zip of the smallest resolution of front images, based the same slicing as for an offline mode
  - [ ] Zip the smallest resolution of front images, based the same slicing

Why

There's no network in many shops and supermarkets, including the one the permanent team went to during Off au Vert 2024
To replace the custom slicing that we currently do, one size fits all for the world with Nutriscore, Nova, Eco-Score, product name, brand, and replace it to something smaller, while improving user XP thanks to the reclaimed storage (by adding full knowledge panels for instance, or a miniature of the product)
Part of
https://github.com/openfoodfacts/smooth-app/issues/2444
6988

stephanegigandet commented 2 years ago

Generating that kind of data for 1000 or more products cannot be on demand, but we can pre-compute it for all countries x their official languages instead.

In fact we already have on demand, it's the search API: https://fr.openfoodfacts.org/api/v2/search?fields=code,product_name&page_size=1000 (but that takes 10 seconds of heavy load on the server, so we could request it when users specifically ask for it in a menu, but certainly not at app initialization, where a pre-computed dump would make much more sense)

Knowledge panels are too big for an offline dump.

stephanegigandet commented 2 years ago

So, to experiment on mobile: just use the search API. Once we're happy with it, we can generate 10k dumps for all countries, with results in exactly the same format as the search API.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity.

teolemon commented 2 years ago

Probably solved by a proper implementation of Search V2 API in the Dart package

https://github.com/openfoodfacts/openfoodfacts-dart/issues/515

CharlesNepote commented 1 year ago

If images don't matter (as it's often the case), CSV exports are very easy to do with Mirabelle (the limit is 1,000,000 lines but we can change it), see below.

I understand the use cases that need images, but is it reasonable to add dozens of MB on a smartphone just to make the search nicer?

Export specific data in CSV with Mirabelle and SQL

1 -- Build your query (or ask someone to build it for you) Eg. all German products that have been scanned at least one time.

-- Products from Germany that have been scanned at least one time
select code, product_name from [all]
where countries_en like "%germany%" and unique_scans_n is not null
order by unique_scans_n desc
-- the limit here displays 20 results; remove it or comment it with "--" when you build your CSV export
limit 20

https://mirabelle.openfoodfacts.org/products?sql=--+Products+from+Germany+that+have+been+scanned+at+least+one+time%0D%0Aselect+code%2C+product_name+from+%5Ball%5D%0D%0Awhere+countries_en+like+%22%25germany%25%22+and+unique_scans_n+is+not+null%0D%0Aorder+by+unique_scans_n+desc%0D%0A--+the+limit+here+displays+20+results%3B+remove+it+or+comment+it+with+%22--%22+when+you+build+your+CSV+export%0D%0Alimit+20

2 -- Copy "CSV" link on the result page.

3 -- If necessary, edit the link to remove the "limit+20" limit to get all the products. Eg. (don't click this link if you don't want to get 90,000+ products) https://mirabelle.openfoodfacts.org/products.csv?sql=--+Products+from+Germany+that+have+been+scanned+at+least+one+time%0D%0Aselect+code%2C+product_name+from+%5Ball%5D%0D%0Awhere+countries_en+like+%22%25germany%25%22+and+unique_scans_n+is+not+null%0D%0Aorder+by+unique_scans_n+desc%0D%0A--+the+limit+here+displays+20+results%3B+remove+it+or+comment+it+with+%22--%22+when+you+build+your+CSV+export%0D%0A&_size=max

Now you can use this link to download the CSV with your favourite tool (wget, curl, web browser, etc.).

teolemon commented 1 year ago

@CharlesNepote @stephanegigandet we need the Knowledge Panel data, so Mirabelle is not really an option.
@stephanegigandet just told me that the API, either by batch of 100 or 1000 is not an option either since it will crash the server
Current PR by @monsieurtanuki : https://github.com/openfoodfacts/smooth-app/pull/4131

monsieurtanuki commented 1 year ago

@teolemon Not sure what you're worrying about, as we already download the top 1k products in just one shot in Smoothie (without KP). If we can manage to extract just the barcodes of the top 10k products, we can loop the product download on a selection of 1k barcodes each time.

stephanegigandet commented 1 year ago

@monsieurtanuki Which fields do you need in the dump? attributes but not knowledge panels? this would be only to show the scan card, but opening the product page would be through a live query?

monsieurtanuki commented 1 year ago

@stephanegigandet I just need the barcodes, sorted by descending popularity. From the barcodes I can get everything, in next queries.

As I split the server queries in smaller queries (e.g. with page numbers) I am robust and fast, compared to a hypothetical 10k query that is demanding (perhaps) for the server, that requires that the connection does not fail, that will prevent the app from doing any other background task meanwhile, that requires to download a huge amount of data at once and to de-json it in one shot. In that specific case I also split in two phases: 1- get the top barcodes, and 2- get the related products.

The point being in the end to download the top 10k products (many fields but not the KP), in the background.

alexgarel commented 1 year ago

@monsieurtanuki we are not sure the server will survive search queries from a lot of users (at this time mongodb is still on a small server). As this is really something which is shared between users, it would seems more logical to generate archive that you can download. If you describe what you need, maybe it's easy to code it as a small script to generate an archive (that could be updated every week).

openfoodfacts / openfoodfacts-server

Generate on-demand miniaturized dump of the database and images for an offline mode #6328

Who for

What

Why

Part of

6988

Export specific data in CSV with Mirabelle and SQL