Closed j0sh closed 4 years ago
Can you repeat the results with a source at 1080p, instead of 720p? I believe the low success rate might be originated from the fact that the current model is trained using comparisons between 1080p and different renditions. It's likely that it is misbehaving when source resolution is lower, hence comparison metrics show more distortion. If this is the case, we will probably need to train different models for different source resolutions.
The lower boundary in the negative side of the verification is expected, and can be explained as a side effect of the Standard Scaler applied to the input features. The image below shows the same kind of behavior for a subsample of our data set:
Can you repeat the results with a source at 1080p, instead of 720p?
That helped, the success rate with a 1080p source is 97% now. Thanks for the explanation of the lower bound!
train different models for different source resolutions
Would it be totally crazy to internally rescale the source to match the model? Not sure what the effect of rescaling with a good scaler, especially upsizing.
Would it be totally crazy to internally rescale the source to match the model? Not sure what the effect of rescaling with a good scaler, especially upsizing.
I'm not sure, and might work. But from a practical standpoint I am not sure what are the consequences for the broacaster / verifier as it means some computational overhead due to upscaling. I am currently researching on the possibility of actually rescaling the metrics according to a factor dependent on the input's resolution. In the image below, different renditions (well encoded) seem to display some pattern of proportionality: This could be exploited to rescale the features depending on the input video size
Summarizing the current state of this issue for the record.
The first and fastest solution is probably to train a separate model for each of a set of common resolutions (i.e. 720p, 480p, 360p). The downsides of this solution are:
While we definitely want to address these downsides, this solution can be used in the interim while we look into additional areas of investigation.
Additional areas of investigation include:
Up until now, we have mainly been measuring the TPR/TNR/F20 scores for the verifier by submitting requests directly to the verifier. In some initial tests using the verifier integrated with the broadcaster, we have observed worse than expected results. The results of these tests are strange for a few reasons:
The 720p tamper model exhibits a much lower TPR than the 1080p tamper model (our results when directly submitting requests to the verifier seem to imply that the 720p and 1080p tamper model should produce similar results)
The 1080p tamper model exhibits a lower than expected TPR compared to our results when directly submitting requests to the verifier
The 1080p tamper model exhibits a lower TPR when verifying a 240p, 360p, 720p rendition than when it is verifying a 240p and 360p rendition
We can start to investigate the cause of these strange results by acquiring sample data when running the verifier integrated with the broadcaster. Some requirements for doing this are:
Some candidates to investigate as the cause of these strange results
We could close this issue as is follows along with https://github.com/livepeer/verification-classifier/issues/88 and https://github.com/livepeer/verification-classifier/issues/93
Tracking the framerate issue in #93 and tracking the test with GPU transcoding in #98 so closing this.
While testing the go-livepeer integration with the classifier, I'm consistently seeing success rates (
tamper > 0
) of less than 30% when trying to verify a normally transcoded video.This is some data from a single run of big buck bunny, with ~2s segments and transcoding a 720p source to 240p and 360p. There are 288 segments, so 576 tamper results for 2 renditions.
If needed, I can make a script that readies some data for further testing.
Failures seem to be especially clustered around the smallest value of
-3.598424
. Here are the 10 lowest tamper values:The high (non-tampered) scores have more of a spread in them. In fact, all instances of
tamper > 0
are unique. Here are the highest 10 tamper values:And here is the distribution of results for a single run of BBB with two renditions (720p -> 240p/360p).