nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
341 stars 126 forks source link

[EPIC]: Replace dfencoder with NVTabular #517

Closed BartleyR closed 1 month ago

BartleyR commented 1 year ago

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

Morpheus currently relies on a fork of dfencoder, especially for Digital Fingerprinting. There are a number of issues with this, including:

This should be replaced with something that is more performant and that is GPU-aware.

Describe your ideal solution

The Merlin team created NVTabular, and it appears that this is a suitable replacement for dfencoder in our pipelines. The steps to get there include:

### Tasks
- [x] Evaluate NVTabular and establish what the parity is between it and dfencoder
- [x] Document differences and requirements for NVTabular and communicate with NVTabular team
- [ ] https://github.com/nv-morpheus/Morpheus/issues/753
- [ ] https://github.com/nv-morpheus/Morpheus/issues/862
- [ ] https://github.com/nv-morpheus/Morpheus/issues/865
- [ ] https://github.com/nv-morpheus/Morpheus/issues/870
- [ ] https://github.com/nv-morpheus/Morpheus/issues/871
- [x] Create a proof-of-concept of the Digital Fingerprinting workflow using NVTabular
- [ ] Benchmark the proof-of-concept to compare
- [ ] Transition existing code using dfencoder to NVTabular

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

mdemoret-nv commented 1 year ago

The following is a quick breakdown of process and necessary steps to update DFEncoder to utilize NVTabular.

Updates to Morpheus core

We should update the morpheus.utils.column_info library to use NVTabular since the functionality is nearly identical but NVT will provide more features that are regularly tested

  1. Update dependencies to include nvtabular in the conda environement
  2. Replace the classes and functions in morpheus.utils.column_info with equivalents from NVTabular
    1. Map all implementations of ColumnInfo to NVTabular operation equivalents
      1. If any cannot be mapped, create custom operations
    2. Replace all uses of the DataFrameInputSchema class with nvt.Schema
      1. This could also use nvt.Workflow potentially
    3. Replace all uses of process_dataframe with nvt.Workflow.fit_transform()
    4. Add metadata tags for use in specific pipelines implementations (i.e. UserID column tags, date tags, etc.)
    5. Ideally, we would keep this as backwards compatible as possible keeping the existing class names and public API and replace the implementation with NVT
  3. Update documentation around new changes to morpheus.utils.column_info

Updates to dfencoder

Given an input dataframe, dfencoder currently does the following 2 things: 1) Setup the dataframe schema based on the column types and values and 2) Build the structure of the auto encoder model from that schema. The schema should be replaced with NVTabular (most likely just the updates to morpheus.utils.column_info) and then build the model from the NVT schema.

  1. Move dfencoder from it's own repository into Morpheus
    1. Copy the dfencoder source files into a new submodule: morpheus.models.dfencoder
      1. Make a copy of these files, as-is, called morpheus.models.dfencoder_old
        1. This is to allow side-by-side testing. They will be removed at the end.
    2. Deprecate the existing repository
  2. Update DFEncoder data ingestion and ETL to use NVTabular.
    1. Update dfencoder.AutoEncoder.init_features()
      1. init_features takes a sample dataframe and determines the schema from the DataFrame's column types
      2. Columns are put into 1 of 3 buckets: categorical, numerical and binary
      3. This should be replaced to use NVTabular's schema and improve on the number of column types available
      4. The ability should be added to dfencoder.AutoEncoder to allow manual overriding of the DataFrame schema (i.e. Supply specific operations or an entire schema which would bypass init_features all together)
    2. Update dfencoder.AutoEncoder.build_model()
      1. build_model takes the determined schema from a sample DataFrame and builds the auto encoder model from this schema.
      2. This requires mapping the schema to specific PyTorch functions before concatting everthing together for the core auto encoder layers.
      3. We would need to update the current code to map from NVTabular schema instead of the current system
      4. We should look at other models in Merlin for examples on how to do this
    3. Update dfencoder.AutoEncoder.prepare_df()
      1. prepare_df takes an input DataFrame and runs it through the preprocesser to get the final input before passing it to PyTorch
      2. This will need to be replaced with a nvt.Workflow that takes the schema determined from the init_features function
      3. We should look at speeding this up with parallelization or sharding across GPUs if needed.
  3. Update DFEncoder training loop to use NVTabular
    1. Update dfencoder.AutoEncoder.fit()
      1. fit currently calls build_model on first use, then prepare_df before finally running a simple PyTorch training loop over batches. This has a few downsides: 1) While the training loop is batched, the ETL and validation steps are not and 2) The loop is poorly implemented which makes extending/improving difficult.
      2. The loop needs to use data loaders to allow for batched processing of the ETL, training and validation parts to reduce memory consumption.
      3. Reference the NVTabular documentation for running a training loop with a data loader. This should be the foundation of the dfencoder training loop
  4. Add tests to DFEncoder
    1. Start with sample tests representative of the DFP type workloads (i.e. use DFP sample data) that run using the existing DFEncoder in the separate repo
    2. Run all those same tests using the new implementation, validating the output against the old code
    3. Add new tests for any additional features that were added above and beyond due to NVT

Updates to DFP

Once Morpheus and DFEncoder have been updated to use NVTabular, there will need to be updates to the DFP pipelines to match the changes and take advantage of the latest features.

  1. Update the Azure, Duo and other DFP workflows with the proper schema using NVT
    1. Assuming much of the API is backwards compatible fromt the changes to morpheus.utils.column_info, the number of changes should be small
    2. Will likely need to add tags and other metadata that we couldnt do before
  2. Update pipeline to use tagged metadata for determining specific columns.
    1. Currently we have a lot of properties called userid_column_name or datetime_column_name which should just pick this info up from the schema metadata
  3. Switch the DFP training loop to use the new DFEncoder classes inside of Morpheus.
    1. Again, assuming the class is mostly backwards compatible, this should mostly be a change to the import.
  4. Validate the DFP pipelines using the new classes
  5. Update DFP documentation with changes from the current implementation

Additional Features

Multi-GPU

Once a baseline version of the DFEncoder model has been created and validated, we could make a new class MultiGpuAutoEncoder, which derives from AutoEncoder but runs the training loop assuming multiple GPUs are available.

  1. Create MultiGpuAutoEncoder deriving from AutoEncoder
  2. Override the base functions as necessary to perform the training across multiple GPUs.
    1. This may involve overriding one or more functions depending on how multi-GPU training works in PyTorch
    2. Additional functionality may need to be pulled out of the base training loop so it can be used by the derived class as well
  3. Add multi GPU training tests
    1. TBD on how this would be run in CI
  4. Update documentation with examples on how to train using multi-GPU

WAG estimate on LOE: 2-3 weeks.

mdemoret-nv commented 1 year ago

Moving tracking issue to next release

mdemoret-nv commented 9 months ago

Deprioritizing