jjjake / internetarchive

A Python and Command-Line Interface to Archive.org
GNU Affero General Public License v3.0
1.58k stars 217 forks source link

Severe issue in Search()._scrape() #610

Open bumatic opened 11 months ago

bumatic commented 11 months ago

internetarchive Version: '3.5.0' (Python version and OS seem irrelevant for this issue)

When searching for queries that return more than 10.000 items, i.e. mediatype:software, the following error is raised always:

if i != num_found:
    raise ReadTimeout('The server failed to return results in the'
                      f' allotted amount of time for {r.request.url}')

When backtracking the issue, I encountered that the named r.request.url is correctly retrieved. In effect my results contained one more item than suggested by the API. In the case of the query mediatype:software its 1043904 for i while num_found is 1043903.

I don’t know why the API returns more results than it indicates for the query, but raising a ReadTimeout error based on the condition i != num_found is too restrictive especially since self._handle_scrape_error(j) is invoked earlier which should catch errors.

Nevertheless, I assume that this condition was included for a reason, which I cannot figure out right now. Therefore I can only suggest rough ideas for resolving this issue. Two that come to mind are removing the if conditional altogether (and potentially enhancing self._handle_scrape_error(j)) or weakening the condition to if i < num_found:

P.s. I checked for duplicate issues, but could not find any. A complete traceback can be provided. However, since I identified the problem, it seemed redundant to me. Let me know, if I am wrong and you want me to post it anyway.

jjjake commented 11 months ago

Thanks for the report @bumatic. This was added to deal with an issue on the archive.org side of things (a timeout happening on the backend leading to the search API failing silently). The aggressive checking of doc count is to avoid someone thinking they dumped a full result set when in fact they haven't.

Let me look into this more and give it some thinking. Thanks again for reporting, and sorry for the trouble.