Open teolemon opened 3 years ago
@teolemon I'm interested!
👋 @monsieurtanuki The full minimized DB is located at https://world.openfoodfacts.org/data/offline/en.openfoodfacts.org.products.small.csv.zip The minimal implementation we've done so far on the classic iPhone app is to unzip it, load it into the database, and show the values stored, and update them as we receive a value from the live API. We re-download the file monthly and update the local db.
And that's it. Smoothie is a little more complex with customized rankings and all, but I think we should start with something simple we can iterate on.
If you'd like to have an idea of additional complexities and sophistications, there's a (somewhat confusing) Google Doc: https://docs.google.com/document/d/1URdZL2pxIP-lCkM9ZzUVMnoSr2o-d3741TyWimzrJI8/edit#
There's also the Swift code for the iOS implem if you'd like to have a look: https://github.com/openfoodfacts/openfoodfacts-ios/commit/20723dab571fd0c155197093190d293886f876bd
also, wanna join the channel on slack ? https://slack.openfoodfacts.org
(for the moment I have no access to the google docs file)
The csv file is about 80Mb big, has 1,555,491 lines plus a header, and the header is
Some stats (with a Unix command like cat en.openfoodfacts.org.products.small.csv | awk -F\t '{print $7}' | sort | uniq -c
)...
ecoscore_grade
is never populatednutrition_grade_fr
is never populatednova_group
:
1.0
2.0
3.0
4.0
That means that there are only 577,722 lines with an attached value.
In addition to that, I saw tons of products in non-latin alphabets.
For the moment I don't see the use case...
aha… ecoscore: normal: not live yet. nutrition_grade_fr >> that's a bug, congratulations for finding it 🎉 :-)
Non latin alphabets: that's because that's the whole db for all countries
I've made a fix server side @monsieurtanuki
Non latin alphabets: that's because that's the whole db for all countries
That's my point about not understanding the use-case: what's the point of pre-downloading the data of all countries? (disclaimer: I have a smartphone with limited memory and disk space)
What I had in mind is: I'm in a supermarket and I want to check the different scores of - say - breakfast cereals, but I have a bad internet connection. Why would I think about downloading in advance data about foods sold in Russia or Japan? There's no "just in case" argumentation. And my personal eco behavior does not find it relevant either. Beyond the fact that most of the time (as mentioned in an early comment) there's no actual data extracted.
What may make sense is to select food categories and countries, and to focus on that. Maybe automatically: I scan a Kellog's in France, therefore there's a good chance I'm interested both in breakfast cereals and products sold in France, and that's the range of foods we could cache and refresh periodically.
It was not a just in case choice, but a speed one:
I've shared the design document, where there's a discussion on how to implement properly all of that.
@monsieurtanuki @stephanegigandet says we should start using this country-specific route as you suggested: https://fr.openfoodfacts.org/api/v2/search?fields=code,product_name&page_size=1000 And he will ensure the transition to 10K products will be transparent, once there's something working
About https://fr.openfoodfacts.org/api/v2/search?fields=code,product_name&page_size=1000:
{"code":"5010477348357","product_name":"Country Crisp 4 noix"}
hive
takes time at init in proportion to the number of records, and would (to be double-checked) pre-load at least the keys (with EAN13 that means around 11Mb)I still don't understand the use-case: downloading 60Mb of world data for a limited added value (barcode => name). And we need to refresh it altogether time after time.
@teolemon As I don't understand the purpose I think you should find someone else for this issue (e.g. someone who understands it) - that's why I removed my assignment to this issue about a year ago. But I can still answer questions, write comments and make suggestions (for instance about alternate solutions).
Speed up or enable (in worst case scenario) scan in supermarkets. @jasmeet0817 told me that he started using the app in real life, after the update for panel expansion. The highlights of his experience were scanning overheating after some time, and difficulty scanning due to network issues.
About https://fr.openfoodfacts.org/api/v2/search?fields=code,product_name&page_size=1000:
- 1000 products (page size is 1000, page number is 1)
- the file size is 67850 bytes
- it's basically a list of {barcode, name}, e.g.
{"code":"5010477348357","product_name":"Country Crisp 4 noix"}
- in total there are 871104 products
- the total file / download/ database size would then be around 60Mb
- we need to be very lucky when downloading the data page by page given the download size and the number of iterations - probable side-effects for products once in page X+1 and later in page X
- we would probably need to use SQFlite again, because even in "lazy" mode
hive
takes time at init in proportion to the number of records, and would (to be double-checked) pre-load at least the keys (with EAN13 that means around 11Mb)I still don't understand the use-case: downloading 60Mb of world data for a limited added value (barcode => name). And we need to refresh it altogether time after time.
@teolemon what's the end goal? To pre-download just (barcodes => name) or the whole product. If it's just names, I agree with @monsieurtanuki. If it's the whole product it's going to take a lot of memory, and the database is always increasing. We would also need a syncing mechanism.
Yes I did have issues where I scanned products and then I was just waiting for it to load, but my phone was also very heated up so I'm not really sure if the root cause was the heating up or network issues. Even if it was network issues, I would try to do a thorough analysis before working on this.
Since this is a non trivial task I would first validate if we really need this, I would add some metrics on how often product fetch calls timeout and if it's higher than a permissible threshold I would go for this feature. If needed, I would also suggest compressing the response payload as much as possible.
@teolemon As I don't understand the purpose I think you should find someone else for this issue (e.g. someone who understands it) - that's why I removed my assignment to this issue about a year ago. But I can still answer questions, write comments and make suggestions (for instance about alternate solutions).
@teolemon There are separate problems here:
cf. https://world.openfoodfacts.org/api/v0/product/093270067481501.json and its 19558 bytes ("all" fields here)
Yes I did have issues where I scanned products and then I was just waiting for it to load
@teolemon @jasmeet0817 For the low connectivity use-case, I suggest that we add in dev mode a switch between the current set of extracted fields, and a minimum set of fields. And then we send @jasmeet0817 go shopping :) (bad luck, it's crowded on Saturdays) What do you think of that, at least for test purposes? Faster scan, faster carousel. And when we go to the product page, we download all the fields (to be coded in a second step, if relevant).
Anyway, it's a bit paradoxical to use scores for fast assessment of products ("it's A, it's D") and flood users with tons of the detailed data from which the scores were computed. In a "more about it..." button, fair enough. But when you're in a busy supermarket with low (or expensive) connectivity, the faster the better.
And more or less that's similar to the OP: a downgraded mode, and a full mode. The difference is that "my" downgraded mode does not handle offline queries and doesn't imply pre-loading tons of data. I can work on that downgraded mode (= limited fields) / full mode.
Following is the current list of product fields we extract for products:
PS: my understanding of offline scanning hasn't changed, as described in the following video :) https://www.youtube.com/watch?v=oyll1XxKh-M
@teolemon The list of fields I provided in the previous comment was the current one - I guess gluten is already there, probably in ATTRIBUTE_GROUPS
.
I suggest that you define a list of product fields that you think we definitely need: with that we can estimate the volume of the full offline database for France.
The query for the 1000 first products with our current list of fields is: https://fr.openfoodfacts.org/api/v2/search?page_size=1000&fields=code,product_name,brands,nutrition_grade_fr,image_small_url,image_front_small_url,image_front_url,image_ingredients_url,image_nutrition_url,image_packaging_url,selected_images,quantity,serving_size,product_quantity,nutriments,additives_tags,nutrient_levels,nutriment_energy_unit,ingredients_analysis_tags,labels,labels_tags_,environment_impact_level_tags,categories_tags_,lang,attribute_groups,states_tags,ecoscore_grade,ecoscore_score,ecoscore_data
The resulting size is 16,840,056 bytes, for 1000 products. For 871104 products it means 14.5 Gb, with all the current fields.
Probably this to display only the summary card ?
We could remove images alltogether. What would be the cost of having 1K to 10K of those ? https://images.openfoodfacts.org/images/products/20301415/front_fr.37.100.jpg
Edit: 2,7kb*1-10k >> 30mb (possibly times 3 to 4 if we want to show which images are available)
There are some fields that we possibly don't need anymore thanks to knowledge panels and attributes
That means 7Gb for the 900K records.
Now that you mention the images, let me point that image data (png, jpg) are not even included, just the urls.
yup, I know, we could save space by removing image urls, or opposedly we could decide to let the user even get images. But probably not for 900k records.
If we really want offline storage, then the best would be to store the bare minimum data for only the top X popular products in a country.
On Sat, Jan 29, 2022, 15:37 monsieurtanuki @.***> wrote:
That means 7Gb for the 900K records.
Now that you mention the images, let me point that image data (png, jpg) are not even included, just the urls.
— Reply to this email directly, view it on GitHub https://github.com/openfoodfacts/smooth-app/issues/18#issuecomment-1024923448, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRHYQTK2LLHAXX4EKEE7LUYP3RZANCNFSM4SCUNPYQ . You are receiving this because you were mentioned.Message ID: @.***>
If we really want offline storage, then the best would be to store the bare minimum data for only the top X popular products in a country.
@jasmeet0817 @teolemon Looks like a very good idea! We could even download everything for those top X popular products.
Additional data following...
Without image_front_small_url: 8,442,485 bytes per 1K products - we don't win anything https://fr.openfoodfacts.org/api/v2/search?page_size=1000&fields=code,product_name,brands,nutrition_grade_fr,quantity,lang,attribute_groups,ecoscore_grade
Without attribute_groups: 281,493 bytes per 1K products - we do win a lot: 30 times smaller! https://fr.openfoodfacts.org/api/v2/search?page_size=1000&fields=code,product_name,brands,nutrition_grade_fr,quantity,lang,ecoscore_grade,image_front_small_url
The thing is that attribute_groups
are too fat and redundant.
For instance, the first attribute of the first product:
{
"description": "", // we don't need
"icon_url": "https://static.openfoodfacts.org/images/attributes/nutriscore-a.svg", // we can use a reference
"name": "Nutri-Score", // we can use a reference
"title": "Nutri-Score A", // we can use a reference
"grade": "a", // mandatory
"id": "nutriscore", // mandatory
"match": 100, // mandatory
"status": "known", // we can assume that the status is known if there's a match > 0
"description_short": "Très bonne qualité nutritionnelle" // we probably can use a reference in most cases
}
I tried to "simplify" the attributes of the first product, and I compressed from 8635 to 2308 bytes (modulo the \n)
I've just simplified again the attributes, this time "à la SQL", and it looks less poetic but much more compact (should take half the space - for attributes):
As I said before you don't change a SQL database structure like you do with JSON files, therefore we should be very careful with what we really want:
where
clauses - obviously on the barcode, but on the name? on the categories? on some attributes?@teolemon @g123k continuing from #5392
I would compute the number of products (1M for France?) multiplied by the size of each product (in a mini version: barcode, name, main image), and then I'll realize that the server doesn't accept more than 10 queries a minute.
That would also mean a specific MiniProduct table.
Actually I don't know the use-case, or more precisely: how minified should the product version be?
I think we should drop images altogether, in favor of less size or more info (eg image status, attributes…) Core use cases are:
We should steer away from search queries, and generate a one-size fits all mini dump for each country with product_name, nutriscore, ecoscore, nova_group, and possibly: attributes, states. We could slice the mini-dump based on user prefs (and remove some attributes)
scanning in a supermarket with no network and getting the score
That could even be the subtitle of a new app! I used an offline map app years ago and the first step was to select which countries to download.
does the product exist, and potentially do we have photos for it
In both cases, if we want to put that inside Smoothie that should be in distinct pages, at least in a first step:
@teolemon you can emulate the no-network-scan-session after downloading a significative set of products (cf. dev mode / offline) and switching to flight mode. Immediate results. Then you can think about which data a user would really need in this or that use case.
What
Why
Part of