sustainable-processes / summit

Optimising chemical reactions using machine learning
https://gosummit.readthedocs.io/en/latest/
MIT License
117 stars 24 forks source link

User defined standardization of inputs when transforming. #242

Open TSAndrews opened 1 year ago

TSAndrews commented 1 year ago

Description

In the current version, standardization options for transforms (min_max_scale_inputs/standardize_inputs) are hardcoded into each of the solver strategies. This makes it difficult for the user to customize them. This is especially important for the less maintained algorithms (such as Nelder Mead / SOBO) that have these settings off by default, as this can significantly reduce algorithm performance.

Recommendation

Move scale/standardization arguments into the initializer of the transform class. This will enable the user to change them relatively easily by just passing an initialized transform class to a strategy. It will remove the need to pass these arguments to each function that uses transform functions and so hardcoding will not be necessary. I believe it would also make a lot more intuitive sense for the user.

marcosfelt commented 1 year ago

I agree that the transforms class needs a significant refactor. So if I understand correctly, you're suggesting changing the Transform API to something like this:

class Transform:
    """Pre/post-processing of data for strategies
    ....
    """

    def __init__(self, domain: Domain, min_max_scale_inputs: bool, standardize_inputs: bool, ...,**kwargs):
         ...

If so, I would definitely be open to this. Would you be interested in submitting a PR?

TSAndrews commented 1 year ago

Yeah that's what I was thinking. There are two viable ways I see of implementing this:

  1. min_max_scale_inputs and standardize_inputs can be saved as instance variables which are used as the defaults when looking for values within kwargs during function calls. Pros: very little code required, backwards compatible with any custom made of built-in strategy classes. Cons: Unexpected behaviour possible if strategy still hard codes options, difficult to make input standardization on by default since min_max_scale_inputs and standardize_inputs are mutually exclusive (If one is on by default the user would have to set it to off to use the other). Strategies could set it on by default when no transformers are passed, but users would have explicitly select them if using Chimera or other transformer.

    class Transform:
    """Pre/post-processing of data for strategies
    ....
    """
    def __init__(self, domain: Domain, min_max_scale_inputs = False, standardize_inputs = False, ...,**kwargs):
         self.min_max_scale_inputs = min_max_scale_inputs
         standardize_inputs = standardize_inputs
         ...
    
    def transform(self,..., **kwargs):
        min_max_scale_inputs = kwargs.get("min_max_scale_inputs", self.min_max_scale_inputs)
        standardize_inputs = kwargs.get("standardize_inputs", self.standardize_inputs)
        ...
  2. Similar to 1 but the meanings of min_max_scale_inputs and standardize_inputs are redifined slightly.

    • standardize_inputs = True means input standardization is applied (either min_max or normalize) instead of meaning that input normalization is applied. If False, neither min_max nor normalization of inputs are applied.
    • min_max_inputs = True implies min_max standardization, min_max_inputs = False implies normalization. min_max_inputs is ignored if standardize_inputs = False

    I believe this system makes a lot more sense and it removes a lot of the Cons with 1; There is no need for mutual exclusive error checking reducing the chances of users seeing errors, standardize_inputs effect is closer to the term "standardizations" meaning, and there are no issues with setting standardization on by default. The main disadvantage is that the change in definition may confuse some users and may change the behaviour of programs using the old syntax.

I can submit a PR for either of these implementations. Both are relatively simple to implement so I can even submit a PR for both if you want to decide later?

marcosfelt commented 1 year ago

Thanks for thinking this through; it made me think a bit more. I like the direction that option 2 is starting to go in since it allows 1) composability of Transforms and 2) separating the fitting and transform steps.

To that end, I am thinking that it actually might be good to adopt an API similar to scikit-learn pipelines. We would have a Transform class with a fit method for finding means/stds (or anything for that matter) and a transform method for doing the actual transformation. The fit and transform method would get called inside a strategy. Additionally, we could have a Pipeline for composing these transforms together. Therefore, we would remove standardization and min_max scaling from the base transform class and put them in separate classes. The Transform and Pipeline API could either be our own implementation or built on top of scikit-learn.

This appraoch would allow custom transforms by users (with smart defaults encoded into strategies) - solving the original issue in this post. Also, separating the fit and transform method would be a first step to allowing users to make predictions on arbitrary points with models trained by a strategy; this has been a top request over the last few months (see for example #214).

What do you think? Again, thanks for raising this - I've been meaning to get around to it for a while.

TSAndrews commented 1 year ago

That would definitely be useful, I'll have a look into it. Doesn't seem like it would be too difficult. I'll also see what I can do about #214. I've already developed a predict function for SOBO for my own work, which looks like what they were looking for. I'll submit a PR for that and look into the others if I have time.

marcosfelt commented 1 year ago

Okay, sounds great! LMK if you need any help!

On Mon, 20 Mar 2023 at 23:34, TSAndrews @.***> wrote:

That would definitely be useful, I'll have a look into it. Doesn't seem like it would be too difficult. I'll also see what I can do about #214 https://github.com/sustainable-processes/summit/issues/214. I've already developed a predict function for SOBO for my own work, which looks like what they were looking for. I'll submit a PR for that and look into the others if I have time.

— Reply to this email directly, view it on GitHub https://github.com/sustainable-processes/summit/issues/242#issuecomment-1477087372, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF3OR2FQWJHLJJST4UQ5WDW5DSSFANCNFSM6AAAAAAV5443QU . You are receiving this because you commented.Message ID: @.***>