Open linogaliana opened 3 years ago
In the original link (for example https://world.openfoodfacts.org/packaging.json
), if I save the data as a .json and open it manually it also shows 10000 rows, so I'm guessing this issue is related to the website API and not the python module?
Hi ! Was just getting started with openfoodfacts when I came across this issue.
The actual problem here is
The 10000 limit is being caused by omission of the parameter sysparm_limit which default value is 10000. If you specify a higher value in the URL then you can get the desired amount of records. https://community.servicenow.com/community?id=community_question&sys_id=ee160f61db1cdbc01dcaf3231f961911
Even though, we can surely increase the limit on the number of records returned, it will almost certainly lead to a decrease in performance and increased waiting times.
A suggested way to solve this issue would be to create a set of new functions which handle pagination. We could have 2 different types of functions : get_all_<facet_name>()
and get_page_<facet_name>()
The get_all_<facet_name>()
function would internally call the get_page_<facet_name>()
function repeatedly until all the pages have been fetched one by one. Since this data can be large we can create a FacetContainer
which shall store the entire fetched data while also providing easy and efficient access to functions which can be helpful in manipulating and moving the data around.
Combined, these 2 suggestions if implemented, should be able to solve the following issues
FacetContainer
class to implement functions to aid in Data Manipulation and movement. This class can be used to add more precise filters to the data stored inside it thereby acting as a powerful tool for working with records.Hi ! Was just getting started with openfoodfacts when I came across this issue.
The actual problem here is
The 10000 limit is being caused by omission of the parameter sysparm_limit which default value is 10000. If you specify a higher value in the URL then you can get the desired amount of records. https://community.servicenow.com/community?id=community_question&sys_id=ee160f61db1cdbc01dcaf3231f961911
Even though, we can surely increase the limit on the number of records returned, it will almost certainly lead to a decrease in performance and increased waiting times.
A suggested way to solve this issue would be to create a set of new functions which handle pagination. We could have 2 different types of functions :
get_all_<facet_name>()
andget_page_<facet_name>()
The
get_all_<facet_name>()
function would internally call theget_page_<facet_name>()
function repeatedly until all the pages have been fetched one by one. Since this data can be large we can create aFacetContainer
which shall store the entire fetched data while also providing easy and efficient access to functions which can be helpful in manipulating and moving the data around.Combined, these 2 suggestions if implemented, should be able to solve the following issues
- [ ] Document or Increase the maximum amount of results using facets #69 : By dividing the entire available data into pages and also providing control over the number of records which should be returned per page.
- [ ] how to get categories by page #56 : The second part of this feature implementation involves the use of the
FacetContainer
class to implement functions to aid in Data Manipulation and movement. This class can be used to add more precise filters to the data stored inside it thereby acting as a powerful tool for working with records.
Would love to work on and implement these features, if they could help address the above mentioned issues.
@linogaliana @MahmoudHamdy02 @Anubhav-Bhargava kindly do let me know about your views on this approach.
Hi, thanks for the suggestion. Yes I think it's a great idea !
This would not penalize people that do not need to retrieve a large number of items but could help others to retrieve more data.
Awesome. Thank you for the review. I'll start working on this feature right away and open a PR once its done.
@linogaliana have you tried using this : https://world.openfoodfacts.org/?json=packaging&page=300
This seems to be working and is paginating the data properly. The problem seems to be due to the direct call to the .json
endpoint in the get_<facet_name>()
function rather than passing it in as an argument like : ?json=packaging
. Also the page argument can be used to get a specific page. Although I believe this still calls for the implementation of the FacetContainer
class, it seems like the paging functionality is already being implemented by the openfoodfacts-server
and is working properly.
Kindly do let me know if the above solution works for you. Thanks !
@alexgarel @teolemon would be grateful if you could kindly review this issue and the above mentioned solution once. Open to suggestions regarding the implementation of the FacetContainer
class. Would it be a good addition to the codebase? Thanks in advance !
Hi @Ansh-Sarkar thank you for trying to contribute, and really sorry for the lag (I let this notification slip away…). Do not hesitate to come and ping us on slack in the #python channel
I'll comment your ticket, but yes I'm 100% in favour of a class.
Also you are encouraged to migrate as much as possible to the current API: https://openfoodfacts.github.io/api-documentation/
warning the search v2 documentation is here (for now): https://wiki.openfoodfacts.org/Open_Food_Facts_Search_API_Version_2
@alexgarel not an issue at all. Had been waiting for a reply in order to commence work on this. Will surely join the slack channel and start implementing this feature. Would be glad to contribute.
Also yeah I'll check out the API documentation as well. Thanks!
I have the impression that the maximum number of echoes when requesting using
facets
is 10 000. I don't see that mentionned in theREADME
or in thedoc
. Is it possible to get more results ? (maybe I am doing something wrong)Otherwise, mentionning that in the
README
would be useful.