Re-train models - Githubissues

yondonfu commented 4 years ago

At the moment, we have 3 trained models:

A UL tamper detection model (OCSVM)
A SL tamper detection model (Catboost Classifier)
A QoE SSIM predictor model (Catboost Regressor)

The UL and SL model are combined into a tamper detection meta model using an AND operator.

The trained models are accessible here.

The training code is on the qoe_model_integration branch.

These models are all trained using the same features extracted from the same dataset based on YT videos. The dataset contained renditions transcoded using the ffmpeg CLI which may not accurately reflect how renditions would be transcoded by LP orchestrators that use LPMS (which is built on top of ffmpeg libraries). Furthermore, LP orchestrators could transcode renditions using either CPUs or Nvidia GPUs

Other differences with LP orchestrator transcoding include:

Renditions can be transcoded using either CPUs or Nvidia GPUs
Videos could have different frame rates besides 30fps

There may be other differences that are not mentioned above.

Given the possible differences between the renditions transcoded using the ffmpeg CLI used in the dataset that the models are currently trained on and the renditions that would be transcoded by LP orchestrators, it could be beneficial to re-train the models using an updated dataset that contains renditions transcoded using LP orchestrators with a variety of options that would be used in production.

We can generate a new feature dataset CSV file by:

Collecting source videos
- The dataset could consist of
  - Publicly available videos
    - YT (i.e. the existing dataset we use)
    - Netflix public data set mentioned here
    - VQEG HD3 public data set mentioned here
  - Videos being sent into the LP network
Create renditions transcoded using options that a LP orchestrator would use (this script might be useful)
Generate attacks based on those renditions
Extract features and dump them into the CSV file

We should make sure to re-train all 3 models using the same dataset.

One approach we could take is:

Create a CSV file with a few hundred rows
Train the models until the OCSVM achieves 95% TPR In training
Add more rows to the CSV file
Train the models with the updated CSV file

Before we do this we should do #98 to get a sense of how the current trained models perform when verifying renditions transcoded by GPU enabled LP orchestrators.

cyberj0g commented 4 years ago

The dataset (data-large.csv) was updated by:

Removing bitrate-adjusted renditions
Adding 5397 framerate-adjusted renditions with 24, 30, 60 target FPS (excluding rendition source FPS) generated from randomly sampled renditions stratified by renditions FPS All models were trained and evaluated on 15% of test data sampled randomly, but keeping renditions of the same source video in separate folds. Of all models, CatBoost classifier performed best in terms of recall and precision of Tamper class. Higher precision and recall for Tamper class, which was the goal of using OCSVM model in combination with CatBoost classifier, are achieved with a single CatBoost model trained on all features, by setting a threshold of 0.9. The code is on this branch. See the notebook for training and benchmark code. Appreciate any feedback.

Sorkanius commented 4 years ago

Just want to point out that CatBoost will (almost) always outperform the OCSVM. The core idea of using the OCSVM is that, initially we focused on this task as an anomaly detection problem.

In this kind of problems data from which does not belong to the original class (untampered videos) is easily available, but the other class is infinite and almost unknown, we can only simulate some of the possible attacks.

The CatBoost, it not only is a more powerful algorithm than the OCSVM, but the main difference is that it has information form the other class, but only on the kind of attacks that has been trained with. This might make the model biased towards these simulated attacks and it has no guarantee of generalizing all kinds of possible attacks that it has not been trained with.

On the other side the OCSVM makes no assumption about the tampered class as it only learns how the untampered videos behave. Of course this comes out with a trade off in accuracy.

cyberj0g commented 4 years ago

Updated the models on GCP and put back previous meta model logic.

livepeer / verification-classifier

Re-train models #107