Closed Doppe1g4nger closed 6 years ago
I am ok with this
One disadvantage to this scheme though is that the user would not get a featurized dataset without already having the pruning done on it. In other words it might be nice to get a featurized dataframe back (containing all features) before tons of features and/or samples are removed.
For example: User: Oh here I will just put my dataset into this matbench preprocessor. Let's see what I get back. Preprocessor: Removes 100 samples and eliminates a bunch of features the user might be expecting User: Wait where did my 100 samples go? And why are there only 10 features returned?
Idk though, what do @albalu @utf @Qi-max think?
I am fine with the proposed pipeline (nice flowchart BTW). I think it's ok if we automatically do some preprocessing and logging it. That could be the default behavior. If someone wants to change something in their pipeline, they can define a new one (as simple as copy and pasting 10 lines of codes) and then change the args within each step or even taking out some steps (e.g. feature reduction). In other words, make their custom pipeline while taking advantage of convenience functions available in matbench. Similar philosophy is implemented in atomate where the user don't have much control over every step of the calculations, only the main stuff, but if they want more control they can define the same workflow but just change a small portion (as I have).
I see your point, I think that's something that can be handled in the preprocessing module though. It wouldn't just successively apply the steps, but would also have some control logic on which steps are applied/how to apply them. The __init__ could look something like:
def __init__(input_data, prune_features=True, remove_malformed_samples=True, ...):
...
...
then the user can set these flags here or in the top level interface to specify how they want to have the data processed, with the default being to apply all steps in the preprocessing pipeline.
Agree that the flow chart is nice. As long as the packages are modular then this shouldn't be an issue for users wanting to customize their pipelines.
I guess we then see the package level classes of Preprocessing
, ModelSelection
and Analysis
as really being convenience classes to bring together the rest of the functionality defined in that package.
In the same way that the top level class will be a convenience function for all the packages put together.
Still would be good to have solid sub-packages for Feature Selection, Featurization, Cleaning, etc. We should probably refactor Preprocessing class --> DataCleaner or something
@Doppe1g4nger I think we will hold off on moving Featurization, FeaturizerSelection, and Preprocessing into the same subpackage for now. Although I will raise a separate issue/todo for splitting [current] preprocessing into two classes as you suggested, DataCleaner and FeatureReduction or something
We can always reopen later if moving everything into one subpackage for preprocessing is advantageous
In addition to the issue in #87 I think it may also be worth looking at the way the training data generation and handling process is structured. E.g. what are currently the top level modules metalearning, featurization, and preprocessing.
I think it was agreed that metalearning isn't a very accurate descriptor of what that submodule does. I think a better name would be something like "heuristic featurizer selection" or maybe just "featurizer selection".
A bigger sell I'd also like to put forward is that preprocessing is a bit of a misnomer for what that submodule currently does. Really all three of the above submodules are part of a preprocessing pipeline. What is currently preprocessing is more accurately a combined data cleaning and feature pruning step. It would make sense to bundle each of these under a comprehensive "preprocessing" or "data handling" module and split out the functionality of each submodule into its constituent steps. That way every top level module produces some complete product for the user and submodules then represent steps to completing that product. See the below diagram on what I think the better structure would look like: