scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.08k stars 25.39k forks source link

Add More Robustness Tests to the California Housing Dataset #29094

Open monikamarr opened 5 months ago

monikamarr commented 5 months ago

This issue proposes enhancements to the testing suite for the California housing dataset in scikit-learn, aimed at increasing its robustness against data corruption and network issues.

The current testing suite does not fully address scenarios where the downloaded dataset may be corrupted or when network failures occur during the download process. Enhancing these tests will ensure the module behaves reliably under adverse conditions, maintaining data integrity, even when external factors like network issues or file corruption occur.

Suggested Tests:

  1. Test for Correct Handling of Corrupted Files -- ensure the data fetching process can handle corrupted files, either by retrying the download or raising an informative error.
    def test_fetch_corrupted_file(fetch_california_housing_fxt, mocker): 
        with mocker.patch('joblib.load', side_effect=OSError("Corrupted file")): 
                with pytest.raises(OSError, match="Corrupted file"): 
                        fetch_california_housing_fxt() 
  2. Test for Retry Mechanism on Failed Download -- verify the dataset fetching retries the download the specified number of times before giving up in case of network errors.

    def test_retry_on_failure(fetch_california_housing_fxt, mocker): 
        mocker.patch('sklearn.datasets._california_housing._fetch_remote', side_effect=Exception("Network error")) 
        with pytest.raises(Exception, match="Network error"): 
                fetch_california_housing_fxt(n_retries=2) 
        assert _fetch_remote.call_count == 2 

    (Tags: good first issue, help wanted, Easy).

glemaitre commented 5 months ago

If I am not wrong, the retry mechanism is shared among all the fetch_* functions (apart from fetch_openml.

We therefore don't have a test specifically for each function, but we test the functionality itself, at least from what I can see when we merged the PR: https://github.com/scikit-learn/scikit-learn/pull/28160/files#diff-e86a2571c78195bb3a3837fb36bdc1acfd7ab14908fcf86801121a07f78f1d3dR373-R393

The current the test take into account for the corrupted side but I think that we could test at the level of granularity of this helper function rather than the california housing function.

adrinjalali commented 5 months ago

I don't think we need to improve the tests, since it wouldn't add safety for the end users. These datasets are also used only for educational purposes. So the data being corrupt leads to minimal risks.

However, I wouldn't mind adding a checksum check in our fetch methods to check against known checksums, and if it's not satisfied, either raise or retry.

@monikamarr would you like to open a PR for that?

pranav-bot commented 5 months ago

@adrinjalali can I open a pr checksum checks for that, just need to know where can I find the checksum hashes for the datasets??

monikamarr commented 5 months ago

I don't think we need to improve the tests, since it wouldn't add safety for the end users. These datasets are also used only for educational purposes. So the data being corrupt leads to minimal risks.

However, I wouldn't mind adding a checksum check in our fetch methods to check against known checksums, and if it's not satisfied, either raise or retry.

@monikamarr would you like to open a PR for that?

Yes, absolutely!