Closed hannah-rae closed 3 years ago
@stevenlujpl I can't think of any scenario where we'd want to load data_to_fit
and data_to_score
differently since they need to have the same data type/dimensionality for the algorithms. I agree that Approach 2 is the cleaner solution.
@hannah-rae, thanks for the response. I will proceed with Approach 2 then.
Sounds good to me!
The ability to pass parameters to data loaders has been implemented. If you are interested in the implementation details, please see this commit (https://github.com/nasaharvest/dora/commit/156a228f5ed8ff13ee5d3aaf128845d2b769ddcb). Please also check out the example config files in this directory to see how to configure the data loaders to pass parameters. Below is a summary of the things we need to be aware of:
data_type
field has been replaced with the dictionary type data_loader
field in the config file. The dictionary type data_loader
field looks like below. The value of the name
field should be the names we are using to register data loaders in the dora_data_loader.py
script. data_loader: {
name: 'test',
params: {
p1: 5,
p2: ['a', 'b', 'c'],
p3: {
k: 5
}
}
}
params
field in the config file will be passed to the loader's _load()
function.class TestLoader(DataLoader):
def __init__(self):
super(TestLoader, self).__init__('test')
def _load(self, dir_path: str, p1, p2, p3) -> dict:
# If the test loader is invoked using the configuration settings above,
# the parameter p1 will be 5, the parameter p2 will be ['a', 'b', 'c'], and
# the parameter p3 will be {k: 5}.
return data_dict
test_loader = TestLoader()
register_data_loader(test_loader)
Please let me know if you have questions or encounter any problems.
Hi team (@hannah-rae, @bdubayah, @vinr515, @urebbapr, @wkiri, @emhuff),
I'd like to ask your opinions on the following two approaches for passing parameters to data loaders. I prefer approach 1 because it gives us the flexibility to use different settings to load data sets specified by
data_to_fit
anddata_to_score
. The downside is that we have to enter the same settings twice if we want to loaddata_to_fit
anddata_to_score
similarly. Here is the question - can we foresee a use case that we want to loaddata_to_fit
anddata_to_score
differently? If so, I think we should implement approach 1. Otherwise, approach 2 might be a cleaner solution. Please let me know what you think. Thanks.Approach 1: modify both
data_to_score
anddata_to_fit
.The
data_to_socre
anddata_to_fit
options in the config files currently look like:To add the ability to pass parameters to data loaders, I changed them to dictionaries. Please see below:
Whatever parameters we put in the
params: {}
dictionaries will be passed to the_load()
functions in data loaders, which is similar to how Outlier Detection and Results Organization modules work.Approach 2: only modify
data_type
.Currently, we use
data_type
in the config file to specify the data loader we want to use. It currently accepts only a String type parameter. I can change the String type parameter to a dictionary and then we can use the dictionary to define parameters. Once this approach is implemented, the config option should look like below (please note that I changeddata_type
todata_loader
and have the type key and value pair inside the dictionary). Whatever parameters we put in theparams: {}
dictionary will be accessible in the_load()
functions in data loaders.