nasaharvest / dora

Domain-agnostic Outlier Ranking Algorithms (DORA) - SMD cross-divisional use case demonstration of AI/ML
MIT License
10 stars 3 forks source link

Add ability to pass parameters to data loaders #40

Closed hannah-rae closed 3 years ago

stevenlujpl commented 3 years ago

Hi team (@hannah-rae, @bdubayah, @vinr515, @urebbapr, @wkiri, @emhuff),

I'd like to ask your opinions on the following two approaches for passing parameters to data loaders. I prefer approach 1 because it gives us the flexibility to use different settings to load data sets specified by data_to_fit and data_to_score. The downside is that we have to enter the same settings twice if we want to load data_to_fit and data_to_score similarly. Here is the question - can we foresee a use case that we want to load data_to_fit and data_to_score differently? If so, I think we should implement approach 1. Otherwise, approach 2 might be a cleaner solution. Please let me know what you think. Thanks.

Approach 1: modify both data_to_score and data_to_fit.

The data_to_socre and data_to_fit options in the config files currently look like:

data_to_fit: '/PATH/TO/DIR/'
data_to_score: '/PATH/TO/DIR/'

To add the ability to pass parameters to data loaders, I changed them to dictionaries. Please see below:

data_to_fit: {
    path: '/PATH/TO/DIR/',
    params: {}
}
data_to_score: {
    path: '/PATH/TO/DIR/',
    params: {}
}

Whatever parameters we put in the params: {} dictionaries will be passed to the _load() functions in data loaders, which is similar to how Outlier Detection and Results Organization modules work.

Approach 2: only modify data_type.

Currently, we use data_type in the config file to specify the data loader we want to use. It currently accepts only a String type parameter. I can change the String type parameter to a dictionary and then we can use the dictionary to define parameters. Once this approach is implemented, the config option should look like below (please note that I changed data_type to data_loader and have the type key and value pair inside the dictionary). Whatever parameters we put in the params: {} dictionary will be accessible in the _load() functions in data loaders.

data_loader: {
    type: 'image',
    params: {}
}
hannah-rae commented 3 years ago

@stevenlujpl I can't think of any scenario where we'd want to load data_to_fit and data_to_score differently since they need to have the same data type/dimensionality for the algorithms. I agree that Approach 2 is the cleaner solution.

stevenlujpl commented 3 years ago

@hannah-rae, thanks for the response. I will proceed with Approach 2 then.

wkiri commented 3 years ago

Sounds good to me!

stevenlujpl commented 3 years ago

The ability to pass parameters to data loaders has been implemented. If you are interested in the implementation details, please see this commit (https://github.com/nasaharvest/dora/commit/156a228f5ed8ff13ee5d3aaf128845d2b769ddcb). Please also check out the example config files in this directory to see how to configure the data loaders to pass parameters. Below is a summary of the things we need to be aware of:

  1. The string type data_type field has been replaced with the dictionary type data_loader field in the config file. The dictionary type data_loader field looks like below. The value of the name field should be the names we are using to register data loaders in the dora_data_loader.py script.
data_loader: {
    name: 'test',
    params: {
        p1: 5,
        p2: ['a', 'b', 'c'],
        p3: {
            k: 5 
        }
    }
}
  1. The pseudocode below is an example data loader with parameters. The key-value pairs defined in the params field in the config file will be passed to the loader's _load() function.
class TestLoader(DataLoader):
    def __init__(self):
        super(TestLoader, self).__init__('test')

    def _load(self, dir_path: str, p1, p2, p3) -> dict:
        # If the test loader is invoked using the configuration settings above, 
        # the parameter p1 will be 5, the parameter p2 will be ['a', 'b', 'c'], and 
        # the parameter p3 will be {k: 5}.

        return data_dict

test_loader = TestLoader()
register_data_loader(test_loader)

Please let me know if you have questions or encounter any problems.