rysavy-ondrej / ethanol

An experimental environment for context-based flow artifact analysis.
1 stars 0 forks source link

Designing a Mechanism for Continuous Learning of Malware models using LightGBM #31

Open rysavy-ondrej opened 9 months ago

rysavy-ondrej commented 9 months ago

The objective is to devise a system for continuous learning that focuses on updating and refining a model with new data. This system will be applied to selected contexts in the following manner:

Use the normal context data and label it as "benign". Also prepare "Infected Context" as follows:

These newly created contexts will serve as inputs for continuous learning.

LightGBM Model Update Methods LightGBM offers multiple options to update the model with new data, each with its own characteristics and implications:

1. Booster.refit() Function: Allows refining the model without increasing the number of trees or the size of the model definition. Considerations: May cause significant changes in the model’s predictions, particularly if the new data batch is much smaller than the original training data or if the target distribution varies greatly.

2. Booster.update() Function: Provides a straightforward interface for model updating. Considerations: A single iteration might not fully integrate the new data into the model, especially with shallow trees (e.g., num_leaves=7) and a small learning rate. Newly-arrived data, even if very different from the original training data, might have a limited impact on the model’s predictions.

3. train(init_model=previous_model) Function: Offers the most flexibility and power in terms of model updating. Parameters to Consider: num_iterations: Number of iterations for training with new data. learning_rate: Determines how quickly the model learns. Balancing Impact: Lower values for num_iterations and learning_rate will reduce the impact of new data on the trained model. Higher values will allow for more significant changes to the model.

Summary Each method of updating the LightGBM model with new data has unique advantages and limitations. The choice of method should be aligned with the specific requirements of your continuous learning framework and the nature of the data being integrated.