openfoodfacts / openfoodfacts-python

🐍 Python package for Open Food Facts
https://openfoodfacts.github.io/openfoodfacts-python/
Other
331 stars 69 forks source link

Document or Increase the maximum amount of results using facets #69

Open linogaliana opened 3 years ago

linogaliana commented 3 years ago

I have the impression that the maximum number of echoes when requesting using facets is 10 000. I don't see that mentionned in the README or in the doc. Is it possible to get more results ? (maybe I am doing something wrong)

Otherwise, mentionning that in the README would be useful.

import openfoodfacts
import pandas as pd

brands = openfoodfacts.facets.get_brands()
brands = pd.json_normalize(brands)
brands.shape
# (10000, 5)
packagings = openfoodfacts.facets.get_packaging()
packagings = pd.json_normalize(packagings)
packagings.shape
# (10000, 6)
MahmoudHamdy02 commented 2 years ago

In the original link (for example https://world.openfoodfacts.org/packaging.json), if I save the data as a .json and open it manually it also shows 10000 rows, so I'm guessing this issue is related to the website API and not the python module?

Ansh-Sarkar commented 2 years ago

Hi ! Was just getting started with openfoodfacts when I came across this issue.

The actual problem here is

The 10000 limit is being caused by omission of the parameter sysparm_limit which default value is 10000. If you specify a higher value in the URL then you can get the desired amount of records. https://community.servicenow.com/community?id=community_question&sys_id=ee160f61db1cdbc01dcaf3231f961911

Even though, we can surely increase the limit on the number of records returned, it will almost certainly lead to a decrease in performance and increased waiting times.

A suggested way to solve this issue would be to create a set of new functions which handle pagination. We could have 2 different types of functions : get_all_<facet_name>() and get_page_<facet_name>()

The get_all_<facet_name>() function would internally call the get_page_<facet_name>() function repeatedly until all the pages have been fetched one by one. Since this data can be large we can create a FacetContainer which shall store the entire fetched data while also providing easy and efficient access to functions which can be helpful in manipulating and moving the data around.

Combined, these 2 suggestions if implemented, should be able to solve the following issues

Ansh-Sarkar commented 2 years ago

Hi ! Was just getting started with openfoodfacts when I came across this issue.

The actual problem here is

The 10000 limit is being caused by omission of the parameter sysparm_limit which default value is 10000. If you specify a higher value in the URL then you can get the desired amount of records. https://community.servicenow.com/community?id=community_question&sys_id=ee160f61db1cdbc01dcaf3231f961911

Even though, we can surely increase the limit on the number of records returned, it will almost certainly lead to a decrease in performance and increased waiting times.

A suggested way to solve this issue would be to create a set of new functions which handle pagination. We could have 2 different types of functions : get_all_<facet_name>() and get_page_<facet_name>()

The get_all_<facet_name>() function would internally call the get_page_<facet_name>() function repeatedly until all the pages have been fetched one by one. Since this data can be large we can create a FacetContainer which shall store the entire fetched data while also providing easy and efficient access to functions which can be helpful in manipulating and moving the data around.

Combined, these 2 suggestions if implemented, should be able to solve the following issues

  • [ ] Document or Increase the maximum amount of results using facets #69 : By dividing the entire available data into pages and also providing control over the number of records which should be returned per page.
  • [ ] how to get categories by page #56 : The second part of this feature implementation involves the use of the FacetContainer class to implement functions to aid in Data Manipulation and movement. This class can be used to add more precise filters to the data stored inside it thereby acting as a powerful tool for working with records.

Would love to work on and implement these features, if they could help address the above mentioned issues.

Ansh-Sarkar commented 2 years ago

@linogaliana @MahmoudHamdy02 @Anubhav-Bhargava kindly do let me know about your views on this approach.

linogaliana commented 2 years ago

Hi, thanks for the suggestion. Yes I think it's a great idea !

This would not penalize people that do not need to retrieve a large number of items but could help others to retrieve more data.

Ansh-Sarkar commented 2 years ago

Awesome. Thank you for the review. I'll start working on this feature right away and open a PR once its done.

Ansh-Sarkar commented 2 years ago

@linogaliana have you tried using this : https://world.openfoodfacts.org/?json=packaging&page=300

This seems to be working and is paginating the data properly. The problem seems to be due to the direct call to the .json endpoint in the get_<facet_name>() function rather than passing it in as an argument like : ?json=packaging . Also the page argument can be used to get a specific page. Although I believe this still calls for the implementation of the FacetContainer class, it seems like the paging functionality is already being implemented by the openfoodfacts-server and is working properly.

Kindly do let me know if the above solution works for you. Thanks !

@alexgarel @teolemon would be grateful if you could kindly review this issue and the above mentioned solution once. Open to suggestions regarding the implementation of the FacetContainer class. Would it be a good addition to the codebase? Thanks in advance !

alexgarel commented 2 years ago

Hi @Ansh-Sarkar thank you for trying to contribute, and really sorry for the lag (I let this notification slip away…). Do not hesitate to come and ping us on slack in the #python channel

I'll comment your ticket, but yes I'm 100% in favour of a class.

Also you are encouraged to migrate as much as possible to the current API: https://openfoodfacts.github.io/api-documentation/

warning the search v2 documentation is here (for now): https://wiki.openfoodfacts.org/Open_Food_Facts_Search_API_Version_2

Ansh-Sarkar commented 2 years ago

@alexgarel not an issue at all. Had been waiting for a reply in order to commence work on this. Will surely join the slack channel and start implementing this feature. Would be glad to contribute.

Also yeah I'll check out the API documentation as well. Thanks!