Closed mm04926412 closed 4 years ago
FeatureReducer requires a target because the underlying algorithms require targets. For example, reduction via correlation coefficients requires a target. The ensemble-based reduction in FR also requires targets.
DataCleaner requires a target because it automatically cleans training samples missing parse-able target values during fitting. While not extensively used, it is required.
AutoFeaturizer can use a target when retrieving cached features to help in validating the retrieved data is correct. However it is not needed most of the time, but is enforced nonetheless.
This package is intended to work as a supervised end to end pipeline first and foremost. However, I don't see why it can't be used for feature generation or other things for unsupervised tasks. I'd be glad to accept a pull request doing this, so if you have code to contribute, please make a PR! The things we could do are:
Thank you for reply!
That makes sense but why is the target required for when call transform? After reading your reply I can now understand why it's required for fit but why is it required for transform?
As well as the unsupervised problem I have a supervised problem as well but it is unclear to me how I can featurize the unlabelled data I wish to perform a prediction on?
For transform you don't need it as an argument :
@check_fitted
def predict(self, df, ignore=None, output_col=None):
"""
Predict a target property of a set of materials.
The dataframe should have the same target property as the dataframe
used for fitting. The dataframe should also have the same materials
property types at the dataframe used for fitting (e.g., if you fit a
matpipe to a df containing composition, your prediction df should have
a column for composition). If you used custom features, make sure those
are included in your prediction df as well.
Args:
df (pandas.DataFrame): Pipe will be fit to this dataframe.
ignore ([str], None): Select which columns to ignore.
These columns will not be used for learning/prediction, but will
simply be appended back to the predicted df at the end of
prediction REGARDLESS of the pipeline configuration.
This will not stop samples from being dropped. If
columns not present in the fitting are not ignored, they will
be automatically dropped. Similarly, if the AutoFeaturizer
is not configured to preserve inputs and they are not ignored,
they will be automatically dropped. Ignoring columns supercedes
all inner operations.
Select columns using:
- [str]: String names of columns to ignore.
- None: input columns will be automatically dropped if they are
inputs. User defined features will be preserved if usable
as ML input.
Returns:
(pandas.DataFrame): The dataframe with target property predictions.
"""
I haven't removed the need for target arg from transform for the underlying classes yet, they are just enforced to be the same as the inherited method. In transform, its only to make sure you you are fit and transforming on the same target string and the columns are ordered sensibly:
# Ensure the order of columns is identical
if target in df.columns:
logger.info(self._log_prefix + "Reordering columns...")
df = df[self.fitted_df.columns]
else:
logger.info(
self._log_prefix + "Target not found in df columns. Ignoring..."
)
reordered_cols = self.fitted_df.drop(columns=[target]).columns
df = df[reordered_cols]
return df
You can just pass in a string matching the target of the fitted DFTransformer. So if you fit on "target 1", just use "target 1" as your transform target and it should work even if it is not in the dataframe. You should not need an empty column, just pass in the correct target string to transform.
Short answer: Just give the same target string to transform that you used for fit, even if that target is not in your new df.
I have tried this but when transforming a new data column appears with my target as the name with some values seemingly pulled from nowhere? I find this very confusing.
the output is a pandas dataframe of form "PCA 0","PCA 1" etc. and then target. As target is not present in the df and I have not attempted to fit a pipeline I have no clue where this target came from?
My apologies this was a bug on my end to do with the way my fit function was handling things in place! Thank you for the assistance the clarification as to why y was required was still very helpful!
I am attempting to perform an unsupervised problem and therefore lack a target. I do not understand why I cannot call the fit and transform functions for AutoFeaturizer,DataCleaner and FeatureReducer without specifying a target?
Do I just need to make a new column of zeros called target to appease the class? I'm very confused and I've probably done something wrong but if is true that unsupervised tasks cannot be natively autofeaturized I would request this be changed.