This is a follow-up to PR #30 - modifies further CICIDS interface to make it easier to use in our setup.
Few ideas that were implemented / bugfixed or updated:
ExperimentRunner check for dataset.path was working for cicids, but was a bad idea because of SynthethicStream raising Errors - it does not have anything like path to dataset file. It could be filtered or separately-check against SynthStream class using isinstance but this is seems like "dirty" solution, so the check was removed instead. Check is now made in __init__ of CICIDS class
Using preprocessing script for cicids adds another level of ambiguity in terms of results in experiments - selecting and obtaining a dataset should be 'easy' and autonomous, convention is one thing but using it in practice is something different
All 3x methods in CICIDS preprocessing pipeline (merge, convert, subset) were made private and combined into generate_cicids_file - one method to obtain dataset file with specified parameters
utils.py was added which now provides get_project_root: Path method - it allows for relative navigation inside project and is used inside CICIDS to provide default path for dataset
all logic was moved to a separate component inside: cicids/preprocessing.py, so it is now hidden inside source files
With CICIDS we're using one class with defined 2 * 10^6 samples, but we're actually using a different subsets and special version with collapsed classes (Attempted -> BENIGN). It is a small update but It might be useful in the future:
dataset.n_samples now shows the real number of samples that were passed to interface (e.g. 400_000 if we're using subset)
convert_attempted: bool = False allows for setting 27 class version with extended classes set - this might come handy as we were discussing possible analysis to show whether there is a difference between attempted samples in case of model behaviour
Tests were added to make sure there are no mistakes in dataset parsing and all fits defined convention - pytest tmp_path was used to isolate test results from original dataset paths.
cicids2017_experiments.py was added to help collaborating on running cicids tests. It checks for dataset (no preprocessing needed) - if there is already dataset presents it passes creation step and starts analysis / experiments.
Examples of current usage can be found in tests and under /experiments for cicids
Following is metadata from W&B for attempted dataset, as we can see it currently shows extended classes set, number of subset samples and path to subset matching convention
This is a follow-up to PR #30 - modifies further CICIDS interface to make it easier to use in our setup. Few ideas that were implemented / bugfixed or updated:
dataset.path
was working for cicids, but was a bad idea because ofSynthethicStream
raising Errors - it does not have anything likepath
to dataset file. It could be filtered or separately-check against SynthStream class usingisinstance
but this is seems like "dirty" solution, so the check was removed instead. Check is now made in__init__
of CICIDS classgenerate_cicids_file
- one method to obtain dataset file with specified parametersutils.py
was added which now providesget_project_root: Path
method - it allows for relative navigation inside project and is used inside CICIDS to provide default path for datasetcicids/preprocessing.py
, so it is now hidden inside source filesdataset.n_samples
now shows the real number of samples that were passed to interface (e.g. 400_000 if we're using subset)convert_attempted: bool = False
allows for setting 27 class version with extended classes set - this might come handy as we were discussing possible analysis to show whether there is a difference between attempted samples in case of model behaviourtmp_path
was used to isolate test results from original dataset paths.cicids2017_experiments.py
was added to help collaborating on running cicids tests. It checks for dataset (no preprocessing needed) - if there is already dataset presents it passes creation step and starts analysis / experiments.Examples of current usage can be found in tests and under
/experiments
for cicids Following is metadata from W&B for attempted dataset, as we can see it currently shows extended classes set, number of subset samples and path to subset matching convention