Missing example dataset

morrissharp commented 2 years ago

The HR promotions dataset listed in Simple Example is not in the dataset directory.

https://github.com/microsoft/responsible-ai-toolbox-mitigations/blob/83c26a3fc61f15c609734126bd4e76e0d922fdbc/notebooks/dataprocessing/module_tests/model_test.ipynb?short_path=e02e26f#L306-L307

mrfmendonca commented 2 years ago

There were instructions of how to download it before.... I think I removed it without knowing. Thanks for letting us know.

morrissharp commented 2 years ago

Ok I've taken a better look and this is what I see:

Some notebooks (e.g. model_test.ipynb) do not download the dataset but are also not pointing to the same place that other notebooks download to
Other notebooks (e.g. data_balance_overall.ipynb) are downloading the datasets, but are downloading it to the main directory, with directory name mitigations-datasets.2.22.2022

morrissharp commented 2 years ago

Not sure if this is the best way to do things, but this function should work in each of the examples.

import os
import zipfile
import pathlib
from urllib.request import urlretrieve
def download_mitigations_datasets(download_dir: str = None,
                dataset_name:str = 'mitigations-datasets.2.22.2022',
                dataset_url:str = 'https://publictestdatasets.blob.core.windows.net/data/',
                exists_ok=False):
    """
    Download the example rai mitigations dataset
    """

    # go up in directory to find location of datasets dir
    if not download_dir:
        cwd = '.'
        while 'datasets' not in os.listdir(cwd):
            cwd = os.path.join(cwd, '..')
        else:
            download_dir = os.path.join(cwd, 'datasets')

    zipfilename = dataset_name + '.zip'

    # check if zip file already exists
    if not pathlib.Path(os.path.join(download_dir, zipfilename)).exists():
        #download again
        urlretrieve(
            os.path.join(dataset_url, zipfilename),
            os.path.join(download_dir, zipfilename),
        )
        #unzip file
        with zipfile.ZipFile(os.path.join(download_dir, zipfilename), "r") as unzip:
            unzip.extractall(os.path.join(download_dir))
    else:
        if not exists_ok:
            raise OSError('Dataset already exists. To redownload, delete files first. Set exists_ok=True to continue.')
    return os.path.join(download_dir, dataset_name)

mrfmendonca commented 2 years ago

I updated the notebooks that used one of the datasets in the datasets/ folder. Now, all of these notebooks will call a function and download the dataset dynamically (but only if the dataset doesn't exist). This way, I also removed the datasets/ from the repo, since it is now created in execution time. Thanks for your input!

microsoft / responsible-ai-toolbox-mitigations

Missing example dataset #24