microsoft / responsible-ai-toolbox-mitigations

Python library for implementing Responsible AI mitigations.
https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/
MIT License
57 stars 6 forks source link

Missing example dataset #24

Closed morrissharp closed 2 years ago

morrissharp commented 2 years ago

The HR promotions dataset listed in Simple Example is not in the dataset directory.

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/83c26a3fc61f15c609734126bd4e76e0d922fdbc/notebooks/dataprocessing/module_tests/model_test.ipynb?short_path=e02e26f#L306-L307

mrfmendonca commented 2 years ago

There were instructions of how to download it before.... I think I removed it without knowing. Thanks for letting us know.

morrissharp commented 2 years ago

Ok I've taken a better look and this is what I see:

morrissharp commented 2 years ago

Not sure if this is the best way to do things, but this function should work in each of the examples.

import os
import zipfile
import pathlib
from urllib.request import urlretrieve
def download_mitigations_datasets(download_dir: str = None,
                dataset_name:str = 'mitigations-datasets.2.22.2022',
                dataset_url:str = 'https://publictestdatasets.blob.core.windows.net/data/',
                exists_ok=False):
    """
    Download the example rai mitigations dataset
    """

    # go up in directory to find location of datasets dir
    if not download_dir:
        cwd = '.'
        while 'datasets' not in os.listdir(cwd):
            cwd = os.path.join(cwd, '..')
        else:
            download_dir = os.path.join(cwd, 'datasets')

    zipfilename = dataset_name + '.zip'

    # check if zip file already exists
    if not pathlib.Path(os.path.join(download_dir, zipfilename)).exists():
        #download again
        urlretrieve(
            os.path.join(dataset_url, zipfilename),
            os.path.join(download_dir, zipfilename),
        )
        #unzip file
        with zipfile.ZipFile(os.path.join(download_dir, zipfilename), "r") as unzip:
            unzip.extractall(os.path.join(download_dir))
    else:
        if not exists_ok:
            raise OSError('Dataset already exists. To redownload, delete files first. Set exists_ok=True to continue.')
    return os.path.join(download_dir, dataset_name)
mrfmendonca commented 2 years ago

I updated the notebooks that used one of the datasets in the datasets/ folder. Now, all of these notebooks will call a function and download the dataset dynamically (but only if the dataset doesn't exist). This way, I also removed the datasets/ from the repo, since it is now created in execution time. Thanks for your input!