livepeer / verification-classifier

Metrics-based Verification Classifier
MIT License
8 stars 7 forks source link

Re-train models #107

Closed yondonfu closed 4 years ago

yondonfu commented 4 years ago

At the moment, we have 3 trained models:

The UL and SL model are combined into a tamper detection meta model using an AND operator.

The trained models are accessible here.

The training code is on the qoe_model_integration branch.

These models are all trained using the same features extracted from the same dataset based on YT videos. The dataset contained renditions transcoded using the ffmpeg CLI which may not accurately reflect how renditions would be transcoded by LP orchestrators that use LPMS (which is built on top of ffmpeg libraries). Furthermore, LP orchestrators could transcode renditions using either CPUs or Nvidia GPUs

Other differences with LP orchestrator transcoding include:

There may be other differences that are not mentioned above.

Given the possible differences between the renditions transcoded using the ffmpeg CLI used in the dataset that the models are currently trained on and the renditions that would be transcoded by LP orchestrators, it could be beneficial to re-train the models using an updated dataset that contains renditions transcoded using LP orchestrators with a variety of options that would be used in production.

We can generate a new feature dataset CSV file by:

We should make sure to re-train all 3 models using the same dataset.

One approach we could take is:

  1. Create a CSV file with a few hundred rows
  2. Train the models until the OCSVM achieves 95% TPR In training
  3. Add more rows to the CSV file
  4. Train the models with the updated CSV file

Before we do this we should do #98 to get a sense of how the current trained models perform when verifying renditions transcoded by GPU enabled LP orchestrators.

cyberj0g commented 4 years ago

The dataset (data-large.csv) was updated by:

  1. Removing bitrate-adjusted renditions
  2. Adding 5397 framerate-adjusted renditions with 24, 30, 60 target FPS (excluding rendition source FPS) generated from randomly sampled renditions stratified by renditions FPS All models were trained and evaluated on 15% of test data sampled randomly, but keeping renditions of the same source video in separate folds. Of all models, CatBoost classifier performed best in terms of recall and precision of Tamper class. Higher precision and recall for Tamper class, which was the goal of using OCSVM model in combination with CatBoost classifier, are achieved with a single CatBoost model trained on all features, by setting a threshold of 0.9. The code is on this branch. See the notebook for training and benchmark code. Appreciate any feedback.
Sorkanius commented 4 years ago

Just want to point out that CatBoost will (almost) always outperform the OCSVM. The core idea of using the OCSVM is that, initially we focused on this task as an anomaly detection problem.

In this kind of problems data from which does not belong to the original class (untampered videos) is easily available, but the other class is infinite and almost unknown, we can only simulate some of the possible attacks.

The CatBoost, it not only is a more powerful algorithm than the OCSVM, but the main difference is that it has information form the other class, but only on the kind of attacks that has been trained with. This might make the model biased towards these simulated attacks and it has no guarantee of generalizing all kinds of possible attacks that it has not been trained with.

On the other side the OCSVM makes no assumption about the tampered class as it only learns how the untampered videos behave. Of course this comes out with a trade off in accuracy.

cyberj0g commented 4 years ago

Updated the models on GCP and put back previous meta model logic.