Closed BartleyR closed 1 month ago
The following is a quick breakdown of process and necessary steps to update DFEncoder to utilize NVTabular.
We should update the morpheus.utils.column_info
library to use NVTabular since the functionality is nearly identical but NVT will provide more features that are regularly tested
nvtabular
in the conda environementmorpheus.utils.column_info
with equivalents from NVTabular
ColumnInfo
to NVTabular operation equivalents
DataFrameInputSchema
class with nvt.Schema
nvt.Workflow
potentiallyprocess_dataframe
with nvt.Workflow.fit_transform()
morpheus.utils.column_info
dfencoder
Given an input dataframe, dfencoder
currently does the following 2 things: 1) Setup the dataframe schema based on the column types and values and 2) Build the structure of the auto encoder model from that schema. The schema should be replaced with NVTabular (most likely just the updates to morpheus.utils.column_info
) and then build the model from the NVT schema.
dfencoder
from it's own repository into Morpheus
dfencoder
source files into a new submodule: morpheus.models.dfencoder
morpheus.models.dfencoder_old
dfencoder.AutoEncoder.init_features()
init_features
takes a sample dataframe and determines the schema from the DataFrame's column typesdfencoder.AutoEncoder
to allow manual overriding of the DataFrame schema (i.e. Supply specific operations or an entire schema which would bypass init_features
all together)dfencoder.AutoEncoder.build_model()
build_model
takes the determined schema from a sample DataFrame and builds the auto encoder model from this schema.dfencoder.AutoEncoder.prepare_df()
prepare_df
takes an input DataFrame and runs it through the preprocesser to get the final input before passing it to PyTorchnvt.Workflow
that takes the schema determined from the init_features
functiondfencoder.AutoEncoder.fit()
fit
currently calls build_model
on first use, then prepare_df
before finally running a simple PyTorch training loop over batches. This has a few downsides: 1) While the training loop is batched, the ETL and validation steps are not and 2) The loop is poorly implemented which makes extending/improving difficult.Once Morpheus and DFEncoder have been updated to use NVTabular, there will need to be updates to the DFP pipelines to match the changes and take advantage of the latest features.
morpheus.utils.column_info
, the number of changes should be smalluserid_column_name
or datetime_column_name
which should just pick this info up from the schema metadataOnce a baseline version of the DFEncoder model has been created and validated, we could make a new class MultiGpuAutoEncoder
, which derives from AutoEncoder
but runs the training loop assuming multiple GPUs are available.
MultiGpuAutoEncoder
deriving from AutoEncoder
WAG estimate on LOE: 2-3 weeks.
Moving tracking issue to next release
Deprioritizing
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
Morpheus currently relies on a fork of dfencoder, especially for Digital Fingerprinting. There are a number of issues with this, including:
This should be replaced with something that is more performant and that is GPU-aware.
Describe your ideal solution
The Merlin team created NVTabular, and it appears that this is a suitable replacement for dfencoder in our pipelines. The steps to get there include:
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct