livepeer / verification-classifier

Metrics-based Verification Classifier
MIT License
8 stars 7 forks source link

Low Success Rate #84

Closed j0sh closed 4 years ago

j0sh commented 4 years ago

While testing the go-livepeer integration with the classifier, I'm consistently seeing success rates (tamper > 0) of less than 30% when trying to verify a normally transcoded video.

This is some data from a single run of big buck bunny, with ~2s segments and transcoding a 720p source to 240p and 360p. There are 288 segments, so 576 tamper results for 2 renditions.

If needed, I can make a script that readies some data for further testing.

Failures seem to be especially clustered around the smallest value of -3.598424 . Here are the 10 lowest tamper values:

    262 -3.598424
      5 -3.598423
      2 -3.598422
      1 -3.598419
      1 -3.598415
      1 -3.598408
      1 -3.598404
      1 -3.598377
      1 -3.598356
      1 -3.598331

The high (non-tampered) scores have more of a spread in them. In fact, all instances of tamper > 0 are unique. Here are the highest 10 tamper values:

      1 0.871325
      1 0.894668
      1 0.945634
      1 0.948086
      1 0.950186
      1 0.959428
      1 0.961007
      1 0.991895
      1 1.007261
      1 1.0253

And here is the distribution of results for a single run of BBB with two renditions (720p -> 240p/360p).

image

ndujar commented 4 years ago

Can you repeat the results with a source at 1080p, instead of 720p? I believe the low success rate might be originated from the fact that the current model is trained using comparisons between 1080p and different renditions. It's likely that it is misbehaving when source resolution is lower, hence comparison metrics show more distortion. If this is the case, we will probably need to train different models for different source resolutions.

The lower boundary in the negative side of the verification is expected, and can be explained as a side effect of the Standard Scaler applied to the input features. The image below shows the same kind of behavior for a subsample of our data set: image.png

j0sh commented 4 years ago

Can you repeat the results with a source at 1080p, instead of 720p?

That helped, the success rate with a 1080p source is 97% now. Thanks for the explanation of the lower bound!

train different models for different source resolutions

Would it be totally crazy to internally rescale the source to match the model? Not sure what the effect of rescaling with a good scaler, especially upsizing.

ndujar commented 4 years ago

Would it be totally crazy to internally rescale the source to match the model? Not sure what the effect of rescaling with a good scaler, especially upsizing.

I'm not sure, and might work. But from a practical standpoint I am not sure what are the consequences for the broacaster / verifier as it means some computational overhead due to upscaling. I am currently researching on the possibility of actually rescaling the metrics according to a factor dependent on the input's resolution. In the image below, different renditions (well encoded) seem to display some pattern of proportionality: image.png This could be exploited to rescale the features depending on the input video size

yondonfu commented 4 years ago

Summarizing the current state of this issue for the record.

The first and fastest solution is probably to train a separate model for each of a set of common resolutions (i.e. 720p, 480p, 360p). The downsides of this solution are:

While we definitely want to address these downsides, this solution can be used in the interim while we look into additional areas of investigation.

Additional areas of investigation include:

  1. Look into whether it is possible rescale metrics according to a factor dependent on the input's resolution. This might be the most ideal solution if it works since we wouldn't have to use multiple trained models for multiple input resolutions
  2. Look into upscaling the input video to 1080p (if the resolution is not already 1080p). The main things to be wary about here are: impact on accuracy due to artifacts introduced by upscaling and performance impact of the verifier doing the upscaling
  3. Look into downscaling the input video to a resolution that we have a trained model for. For example, suppose we have a trained model for 480p, but not for 576p. Given an 576p input video, we would downscale it to 480p. The main things to be wary about here are: impact on accuracy due to artifacts introduced by downscaling and performance impact of the verifier doing the downscaling. Downscaling might be better than upscaling, but it might nonetheless add noise.
yondonfu commented 4 years ago

Up until now, we have mainly been measuring the TPR/TNR/F20 scores for the verifier by submitting requests directly to the verifier. In some initial tests using the verifier integrated with the broadcaster, we have observed worse than expected results. The results of these tests are strange for a few reasons:

  1. The 720p tamper model exhibits a much lower TPR than the 1080p tamper model (our results when directly submitting requests to the verifier seem to imply that the 720p and 1080p tamper model should produce similar results)

  2. The 1080p tamper model exhibits a lower than expected TPR compared to our results when directly submitting requests to the verifier

  3. The 1080p tamper model exhibits a lower TPR when verifying a 240p, 360p, 720p rendition than when it is verifying a 240p and 360p rendition

We can start to investigate the cause of these strange results by acquiring sample data when running the verifier integrated with the broadcaster. Some requirements for doing this are:

Some candidates to investigate as the cause of these strange results

ndujar commented 4 years ago

We could close this issue as is follows along with https://github.com/livepeer/verification-classifier/issues/88 and https://github.com/livepeer/verification-classifier/issues/93

yondonfu commented 4 years ago

Tracking the framerate issue in #93 and tracking the test with GPU transcoding in #98 so closing this.