Open vanderryan opened 2 years ago
Description of the plan of your feature engineering process: DNS queries are highly important to network flows but the DNS variables in the dataset are ambiguous. I want to keep the DNS aspects but want to get rid of the ambugity. The Protocol variables are also not clear but it's highly certain that UDP and TCP are exposed through this.
Rationale of your choice of feature: Mostly stated above but want to keep DNS whilst reducing ambiguity and same thing with Protocol variable.
Plan for unit testing after introducing this feature: Unit tests should ensure only correct protocols are used and DNS variables are reduced down Tests should also verify all features that were intended to drop actually dropped. Tests should verify pipeline implemented correctly i.e. categories and Num types/not nulls imputed correctly
Description of the variable you plan to perform feature engineering on: The PROTOCOL variable was found to be 17 = TCP and 6 = UDP; keeping just these two The DNS variables are ambiguous so condensing them down to be just DNS = Y or N