hackingmaterials / automatminer

An automatic engine for predicting materials properties.
Other
138 stars 51 forks source link

Remove target from autofeaturizer? #334

Closed mm04926412 closed 4 years ago

mm04926412 commented 4 years ago

I am attempting to perform an unsupervised problem and therefore lack a target. I do not understand why I cannot call the fit and transform functions for AutoFeaturizer,DataCleaner and FeatureReducer without specifying a target?

Do I just need to make a new column of zeros called target to appease the class? I'm very confused and I've probably done something wrong but if is true that unsupervised tasks cannot be natively autofeaturized I would request this be changed.

ardunn commented 4 years ago

Hey there! Please post questions like this on our discussion forum (https://matsci.org), the issue tracker is used for internal development only.

Current requirements for targets

FeatureReducer requires a target because the underlying algorithms require targets. For example, reduction via correlation coefficients requires a target. The ensemble-based reduction in FR also requires targets.

DataCleaner requires a target because it automatically cleans training samples missing parse-able target values during fitting. While not extensively used, it is required.

AutoFeaturizer can use a target when retrieving cached features to help in validating the retrieved data is correct. However it is not needed most of the time, but is enforced nonetheless.

Supervised vs Unsupervised

This package is intended to work as a supervised end to end pipeline first and foremost. However, I don't see why it can't be used for feature generation or other things for unsupervised tasks. I'd be glad to accept a pull request doing this, so if you have code to contribute, please make a PR! The things we could do are:

mm04926412 commented 4 years ago

Thank you for reply!

That makes sense but why is the target required for when call transform? After reading your reply I can now understand why it's required for fit but why is it required for transform?

As well as the unsupervised problem I have a supervised problem as well but it is unclear to me how I can featurize the unlabelled data I wish to perform a prediction on?

ardunn commented 4 years ago

For MatPipe

For transform you don't need it as an argument :

@check_fitted
    def predict(self, df, ignore=None, output_col=None):
        """
        Predict a target property of a set of materials.
        The dataframe should have the same target property as the dataframe
        used for fitting. The dataframe should also have the same materials
        property types at the dataframe used for fitting (e.g., if you fit a
        matpipe to a df containing composition, your prediction df should have
        a column for composition). If you used custom features, make sure those
        are included in your prediction df as well.
        Args:
            df (pandas.DataFrame): Pipe will be fit to this dataframe.
            ignore ([str], None): Select which columns to ignore.
                These columns will not be used for learning/prediction, but will
                simply be appended back to the predicted df at the end of
                prediction REGARDLESS of the pipeline configuration.
                This will not stop samples from being dropped. If
                columns not present in the fitting are not ignored, they will
                be automatically dropped. Similarly, if the AutoFeaturizer
                is not configured to preserve inputs and they are not ignored,
                they will be automatically dropped. Ignoring columns supercedes
                all inner operations.
                Select columns using:
                - [str]: String names of columns to ignore.
                - None: input columns will be automatically dropped if they are
                    inputs. User defined features will be preserved if usable
                    as ML input.
        Returns:
            (pandas.DataFrame): The dataframe with target property predictions.
        """

For constituent DFTransformers

I haven't removed the need for target arg from transform for the underlying classes yet, they are just enforced to be the same as the inherited method. In transform, its only to make sure you you are fit and transforming on the same target string and the columns are ordered sensibly:

        # Ensure the order of columns is identical
        if target in df.columns:
            logger.info(self._log_prefix + "Reordering columns...")
            df = df[self.fitted_df.columns]
        else:
            logger.info(
                self._log_prefix + "Target not found in df columns. Ignoring..."
            )
            reordered_cols = self.fitted_df.drop(columns=[target]).columns
            df = df[reordered_cols]
        return df

You can just pass in a string matching the target of the fitted DFTransformer. So if you fit on "target 1", just use "target 1" as your transform target and it should work even if it is not in the dataframe. You should not need an empty column, just pass in the correct target string to transform.

ardunn commented 4 years ago

Short answer: Just give the same target string to transform that you used for fit, even if that target is not in your new df.

mm04926412 commented 4 years ago

I have tried this but when transforming a new data column appears with my target as the name with some values seemingly pulled from nowhere? I find this very confusing.

the output is a pandas dataframe of form "PCA 0","PCA 1" etc. and then target. As target is not present in the df and I have not attempted to fit a pipeline I have no clue where this target came from?

mm04926412 commented 4 years ago

My apologies this was a bug on my end to do with the way my fit function was handling things in place! Thank you for the assistance the clarification as to why y was required was still very helpful!