We want to add a new feature to our DFP pipeline that contains the most recent log events leading up to the current event in a row. I'm basically using an sklearn multilabel binarizer to get this to work because the raw column data just explodes computationally. The problem I'm having is figuring out how to incorporate this into Morpheus, currently I am executing that code after Morpheus loads the data but right before it fits the AE model on it and while it starts to train it eventually errors out.
The new feature is a list of event type strings, something like this: [UpdateInstanceAssociationStatus, UpdateInstance, DescribeInstance, DescribeBucket, DescribeInstance]. The multi label binarizer will generate N new binary columns for each possible event in the original column and give a 1 or 0 if the entry is present in that row's list of events. So it should be equivalent to other rows that are being one-hot encoded, I think the autoencoder can deal with it. There's ~800 possible values but over 1M list combinations in the data so it was blowing up when training before I tried the binarizer.
Minimum reproducible example
Change to on_data in dfp_training.py:
def on_data(self, message: MultiDFPMessage):
if (message is None or message.mess_count == 0):
return None
user_id = message.user_id
model = AutoEncoder(**self._model_kwargs)
meta_df = message.get_meta_dataframe()
# Only train on the feature columns
train_df = meta_df[meta_df.columns.intersection(self._config.ae.feature_columns)]
mlb = MultiLabelBinarizer(sparse_output=True)
meta_df = meta_df.join(
pd.DataFrame.sparse.from_spmatrix(
mlb.fit_transform(meta_df['last5_events'].str.split(', ')),
index=meta_df.index,
columns=mlb.classes_).sparse.to_dense())
validation_df = None
run_validation = False
train_df = meta_df
# Split into training and validation sets
if self._validation_size > 0.0:
train_df, validation_df = train_test_split(train_df, test_size=self._validation_size, shuffle=False)
run_validation = True
logger.debug("Training AE model for user: '%s'...", user_id)
model.fit(train_df, epochs=self._epochs, val=validation_df, run_validation=run_validation)
logger.debug("Training AE model for user: '%s'... Complete.", user_id)
output_message = MultiAEMessage(meta=DFPMessageMeta(df=meta_df, user_id=user_id),
mess_offset=message.mess_offset,
mess_count=message.mess_count,
model=model)
return output_message
Version
23.01
Which installation method(s) does this occur on?
Source
Describe the bug.
We want to add a new feature to our DFP pipeline that contains the most recent log events leading up to the current event in a row. I'm basically using an sklearn multilabel binarizer to get this to work because the raw column data just explodes computationally. The problem I'm having is figuring out how to incorporate this into Morpheus, currently I am executing that code after Morpheus loads the data but right before it fits the AE model on it and while it starts to train it eventually errors out.
The new feature is a list of event type strings, something like this: [UpdateInstanceAssociationStatus, UpdateInstance, DescribeInstance, DescribeBucket, DescribeInstance]. The multi label binarizer will generate N new binary columns for each possible event in the original column and give a 1 or 0 if the entry is present in that row's list of events. So it should be equivalent to other rows that are being one-hot encoded, I think the autoencoder can deal with it. There's ~800 possible values but over 1M list combinations in the data so it was blowing up when training before I tried the binarizer.
Minimum reproducible example
Change to
on_data
indfp_training.py
:Relevant log output
Full env printout
No response
Other/Misc.
No response
Code of Conduct