openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
661 stars 389 forks source link

Generate on-demand miniaturized dump of the database and images for an offline mode #6328

Open teolemon opened 2 years ago

teolemon commented 2 years ago

Who for

### Tasks
- [ ] Create an API to generate on-demand miniaturized dump of the database for an offline mode
  - [ ] Generate on-demand miniaturized dump of the database for an offline mode
  - [ ] Allow slicing on-demand miniaturized dump of the database by country, popularity, stores, language, required fields
- [ ] Create an API to generate on-demand Zip of the smallest resolution of front images, based the same slicing as for an offline mode
  - [ ] Zip the smallest resolution of front images, based the same slicing

Why

stephanegigandet commented 2 years ago

Generating that kind of data for 1000 or more products cannot be on demand, but we can pre-compute it for all countries x their official languages instead.

In fact we already have on demand, it's the search API: https://fr.openfoodfacts.org/api/v2/search?fields=code,product_name&page_size=1000 (but that takes 10 seconds of heavy load on the server, so we could request it when users specifically ask for it in a menu, but certainly not at app initialization, where a pre-computed dump would make much more sense)

Knowledge panels are too big for an offline dump.

stephanegigandet commented 2 years ago

So, to experiment on mobile: just use the search API. Once we're happy with it, we can generate 10k dumps for all countries, with results in exactly the same format as the search API.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity.

teolemon commented 2 years ago

Probably solved by a proper implementation of Search V2 API in the Dart package

CharlesNepote commented 1 year ago

If images don't matter (as it's often the case), CSV exports are very easy to do with Mirabelle (the limit is 1,000,000 lines but we can change it), see below.

I understand the use cases that need images, but is it reasonable to add dozens of MB on a smartphone just to make the search nicer?

Export specific data in CSV with Mirabelle and SQL

1 -- Build your query (or ask someone to build it for you) Eg. all German products that have been scanned at least one time.

-- Products from Germany that have been scanned at least one time
select code, product_name from [all]
where countries_en like "%germany%" and unique_scans_n is not null
order by unique_scans_n desc
-- the limit here displays 20 results; remove it or comment it with "--" when you build your CSV export
limit 20

https://mirabelle.openfoodfacts.org/products?sql=--+Products+from+Germany+that+have+been+scanned+at+least+one+time%0D%0Aselect+code%2C+product_name+from+%5Ball%5D%0D%0Awhere+countries_en+like+%22%25germany%25%22+and+unique_scans_n+is+not+null%0D%0Aorder+by+unique_scans_n+desc%0D%0A--+the+limit+here+displays+20+results%3B+remove+it+or+comment+it+with+%22--%22+when+you+build+your+CSV+export%0D%0Alimit+20

2 -- Copy "CSV" link on the result page.

3 -- If necessary, edit the link to remove the "limit+20" limit to get all the products. Eg. (don't click this link if you don't want to get 90,000+ products) https://mirabelle.openfoodfacts.org/products.csv?sql=--+Products+from+Germany+that+have+been+scanned+at+least+one+time%0D%0Aselect+code%2C+product_name+from+%5Ball%5D%0D%0Awhere+countries_en+like+%22%25germany%25%22+and+unique_scans_n+is+not+null%0D%0Aorder+by+unique_scans_n+desc%0D%0A--+the+limit+here+displays+20+results%3B+remove+it+or+comment+it+with+%22--%22+when+you+build+your+CSV+export%0D%0A&_size=max

Now you can use this link to download the CSV with your favourite tool (wget, curl, web browser, etc.).

teolemon commented 1 year ago
monsieurtanuki commented 1 year ago

@teolemon Not sure what you're worrying about, as we already download the top 1k products in just one shot in Smoothie (without KP). If we can manage to extract just the barcodes of the top 10k products, we can loop the product download on a selection of 1k barcodes each time.

stephanegigandet commented 1 year ago

@monsieurtanuki Which fields do you need in the dump? attributes but not knowledge panels? this would be only to show the scan card, but opening the product page would be through a live query?

monsieurtanuki commented 1 year ago

@stephanegigandet I just need the barcodes, sorted by descending popularity. From the barcodes I can get everything, in next queries.

As I split the server queries in smaller queries (e.g. with page numbers) I am robust and fast, compared to a hypothetical 10k query that is demanding (perhaps) for the server, that requires that the connection does not fail, that will prevent the app from doing any other background task meanwhile, that requires to download a huge amount of data at once and to de-json it in one shot. In that specific case I also split in two phases: 1- get the top barcodes, and 2- get the related products.

The point being in the end to download the top 10k products (many fields but not the KP), in the background.

alexgarel commented 1 year ago

@monsieurtanuki we are not sure the server will survive search queries from a lot of users (at this time mongodb is still on a small server). As this is really something which is shared between users, it would seems more logical to generate archive that you can download. If you describe what you need, maybe it's easy to code it as a small script to generate an archive (that could be updated every week).