Closed yondonfu closed 4 years ago
The dataset (data-large.csv) was updated by:
Just want to point out that CatBoost will (almost) always outperform the OCSVM. The core idea of using the OCSVM is that, initially we focused on this task as an anomaly detection problem.
In this kind of problems data from which does not belong to the original class (untampered videos) is easily available, but the other class is infinite and almost unknown, we can only simulate some of the possible attacks.
The CatBoost, it not only is a more powerful algorithm than the OCSVM, but the main difference is that it has information form the other class, but only on the kind of attacks that has been trained with. This might make the model biased towards these simulated attacks and it has no guarantee of generalizing all kinds of possible attacks that it has not been trained with.
On the other side the OCSVM makes no assumption about the tampered class as it only learns how the untampered videos behave. Of course this comes out with a trade off in accuracy.
Updated the models on GCP and put back previous meta model logic.
At the moment, we have 3 trained models:
The UL and SL model are combined into a tamper detection meta model using an AND operator.
The trained models are accessible here.
The training code is on the qoe_model_integration branch.
These models are all trained using the same features extracted from the same dataset based on YT videos. The dataset contained renditions transcoded using the ffmpeg CLI which may not accurately reflect how renditions would be transcoded by LP orchestrators that use LPMS (which is built on top of ffmpeg libraries). Furthermore, LP orchestrators could transcode renditions using either CPUs or Nvidia GPUs
Other differences with LP orchestrator transcoding include:
There may be other differences that are not mentioned above.
Given the possible differences between the renditions transcoded using the ffmpeg CLI used in the dataset that the models are currently trained on and the renditions that would be transcoded by LP orchestrators, it could be beneficial to re-train the models using an updated dataset that contains renditions transcoded using LP orchestrators with a variety of options that would be used in production.
We can generate a new feature dataset CSV file by:
We should make sure to re-train all 3 models using the same dataset.
One approach we could take is:
Before we do this we should do #98 to get a sense of how the current trained models perform when verifying renditions transcoded by GPU enabled LP orchestrators.