nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
357 stars 131 forks source link

[BUG]: DFP Rolling Window error after adding new string list feature #809

Closed efajardo-nv closed 1 year ago

efajardo-nv commented 1 year ago

Version

23.01

Which installation method(s) does this occur on?

Source

Describe the bug.

We want to add a new feature to our DFP pipeline that contains the most recent log events leading up to the current event in a row. I'm basically using an sklearn multilabel binarizer to get this to work because the raw column data just explodes computationally. The problem I'm having is figuring out how to incorporate this into Morpheus, currently I am executing that code after Morpheus loads the data but right before it fits the AE model on it and while it starts to train it eventually errors out.

The new feature is a list of event type strings, something like this: [UpdateInstanceAssociationStatus, UpdateInstance, DescribeInstance, DescribeBucket, DescribeInstance]. The multi label binarizer will generate N new binary columns for each possible event in the original column and give a 1 or 0 if the entry is present in that row's list of events. So it should be equivalent to other rows that are being one-hot encoded, I think the autoencoder can deal with it. There's ~800 possible values but over 1M list combinations in the data so it was blowing up when training before I tried the binarizer.

Minimum reproducible example

Change to on_data in dfp_training.py:

def on_data(self, message: MultiDFPMessage):
        if (message is None or message.mess_count == 0):
            return None

        user_id = message.user_id

        model = AutoEncoder(**self._model_kwargs)

        meta_df = message.get_meta_dataframe()

        # Only train on the feature columns
        train_df = meta_df[meta_df.columns.intersection(self._config.ae.feature_columns)]
        mlb = MultiLabelBinarizer(sparse_output=True)
        meta_df = meta_df.join(
                        pd.DataFrame.sparse.from_spmatrix(
                            mlb.fit_transform(meta_df['last5_events'].str.split(', ')),
                            index=meta_df.index,
                            columns=mlb.classes_).sparse.to_dense())
        validation_df = None
        run_validation = False

        train_df = meta_df
        # Split into training and validation sets
        if self._validation_size > 0.0:
            train_df, validation_df = train_test_split(train_df, test_size=self._validation_size, shuffle=False)
            run_validation = True

        logger.debug("Training AE model for user: '%s'...", user_id)
        model.fit(train_df, epochs=self._epochs, val=validation_df, run_validation=run_validation)
        logger.debug("Training AE model for user: '%s'... Complete.", user_id)

        output_message = MultiAEMessage(meta=DFPMessageMeta(df=meta_df, user_id=user_id),
                                        mess_offset=message.mess_offset,
                                        mess_count=message.mess_count,
                                        model=model)

        return output_message

Relevant log output

[E20230320 13:25:15.442346 2547285 context.cpp:124] linear_segment_0/dfp-rolling-window-5; rank: 0; size: 1; tid:
 140020782438144: set_exception issued; issuing kill to current runnable. Exception msg: TypeError: unhashable type: 'numpy.ndarray'                           
 73%|███████████████████████████████████████████████████████████████████████████████████████                                 | 589/812 [11:34<00:37,  5.89it/s]
At:                                                                                                                                                            
  pandas/_libs/hashtable_class_helper.pxi(5310): pandas._libs.hashtable.PyObjectHashTable._unique                                                              
  /home/xyz/python/morpheus_env/lib/python3.8/site-packages/pandas/core/algorithms.py(563): factorize_array                               
  /home/xyz/python/morpheus_env/lib/python3.8/site-packages/pandas/core/algorithms.py(761): factorize                                     
  /home/xyz/python/morpheus_env/lib/python3.8/site-packages/pandas/core/util/hashing.py(333): _hash_ndarray                               
  /home/xyz/python/morpheus_env/lib/python3.8/site-packages/pandas/core/util/hashing.py(295): hash_array                                  
  /home/xyz/python/morpheus_env/lib/python3.8/site-packages/pandas/core/util/hashing.py(143): <genexpr>

Full env printout

No response

Other/Misc.

No response

Code of Conduct

efajardo-nv commented 1 year ago

Resolved by clearing the file cache (i.e. deleting .cache directory).