Open monikamarr opened 5 months ago
If I am not wrong, the retry mechanism is shared among all the fetch_*
functions (apart from fetch_openml
.
We therefore don't have a test specifically for each function, but we test the functionality itself, at least from what I can see when we merged the PR: https://github.com/scikit-learn/scikit-learn/pull/28160/files#diff-e86a2571c78195bb3a3837fb36bdc1acfd7ab14908fcf86801121a07f78f1d3dR373-R393
The current the test take into account for the corrupted side but I think that we could test at the level of granularity of this helper function rather than the california housing function.
I don't think we need to improve the tests, since it wouldn't add safety for the end users. These datasets are also used only for educational purposes. So the data being corrupt leads to minimal risks.
However, I wouldn't mind adding a checksum check in our fetch methods to check against known checksums, and if it's not satisfied, either raise or retry.
@monikamarr would you like to open a PR for that?
@adrinjalali can I open a pr checksum checks for that, just need to know where can I find the checksum hashes for the datasets??
I don't think we need to improve the tests, since it wouldn't add safety for the end users. These datasets are also used only for educational purposes. So the data being corrupt leads to minimal risks.
However, I wouldn't mind adding a checksum check in our fetch methods to check against known checksums, and if it's not satisfied, either raise or retry.
@monikamarr would you like to open a PR for that?
Yes, absolutely!
This issue proposes enhancements to the testing suite for the California housing dataset in scikit-learn, aimed at increasing its robustness against data corruption and network issues.
The current testing suite does not fully address scenarios where the downloaded dataset may be corrupted or when network failures occur during the download process. Enhancing these tests will ensure the module behaves reliably under adverse conditions, maintaining data integrity, even when external factors like network issues or file corruption occur.
Suggested Tests:
Test for Retry Mechanism on Failed Download -- verify the dataset fetching retries the download the specified number of times before giving up in case of network errors.
(Tags: good first issue, help wanted, Easy).