Using macroblocks from three separate transcode rendition jobs to determine content authenticity

yondonfu commented 5 years ago

Using macroblocks from three separate transcode rendition jobs to determine desired content:

What is a Macroblock? https://en.m.wikipedia.org/wiki/Macroblock

A macroblock is a processing unit in image and video compression formats based on linear block transforms, such as the discrete cosine transform (DCT). A macroblock typically consists of 16=C3=9716 samples, and is further subdivided into transform blocks, and may be further subdivided into prediction blocks.

We didn't even need to get that technical here - For the scope of this effort, we only need to think of macroblocks as the blocks found in a low-quality, zoomed-out fuzzy low resolution image, generated from our transcoded renditions.

For example, take a 720p video segment. We can extract a still image from the I-Frame at the start of the segment. We can scale this 1280x720 image to a much smaller resolution, say, 128x72. Now we have much less data to deal with for comparisons, yet, enough data to indicate the reality of the content.

My idea for content authenticity verification is to leverage the data gathered from analyzing these macroblocks to determine if we're looking at the right content.

The idea is to take segments from three completed transcode jobs and to use them to determine the valid stream. How exactly do we analyze the data? I'm sure we will all have some different ideas here, but the basic idea is to use scaled down still images to determine that the renditions match.

One possible example is to compare every pixel and make sure that the color values fall within a given percent error range. We look to make sure that the data from 2 or more images match, or fuzzy match each other, to determine that the group is valid. There may be outliers, and those are less likely to be the authentic content.

yondonfu commented 5 years ago

For example, take a 720p video segment. We can extract a still image from the I-Frame at the start of the segment.

Random sampling the I-Frame might be more robust from a security standpoint.

enough data to indicate the reality of the content.

Interestingly, the latest verification classifier performance results for various # of frames sampled includes a section with frames scaled down to 128x72. I believe running the verification classifier with these downsized frames does not negatively impact the TPR rate (but apparently bigger frames positively impact the TNR rate). If 128x72 frames can be used for the verification classifier, then using them for this redundancy based content authenticity check seems promising although the feasibility probably ultimately depends on the actual algorithm used to compare the frames extracted from each of the renditions.

We look to make sure that the data from 2 or more images match, or fuzzy match each other, to determine that the group is valid. There may be outliers, and those are less likely to be the authentic content.

Seems that there are two components to this redundancy based content authenticity check:

Redundantly transcoding a single source segment
Applying a comparison algorithm on the redundant renditions for the source segment

1 might be useful in its own right - you can tweak your redundancy factor to increase/decrease your confidence in a segment. It is worth noting that this scheme would require a broadcaster to wait for 3 renditions before being able to insert a rendition into its playlist. One way to help with the latency concerns here could be to use a M-of-N approach where the broadcaster sends the source to N orchestrators and takes the first M returned for the content authenticity check

2 is basically a variation of the current problem that the verification classifier project seeks to solve. The verification classifier project tries to directly determine if video is transcoded correctly by comparing the source with the rendition. This redundancy approach tries to indirectly determine if video is transcoded correctly by comparing a set of redundant renditions while making an honesty assumption of no collusion (although the dependency on this assumption lessens as the redundancy factor increases) about the orchestrators returning the renditions. I think the choice of comparison algorithm is the important piece that needs clarity here.

mkrufky commented 5 years ago

For example, take a 720p video segment. We can extract a still image from the I-Frame at the start of the segment.

Random sampling the I-Frame might be more robust from a security standpoint.

For sure. We should sample at random. I was only citing an example here so that everybody can understand it.

If 128x72 frames can be used for the verification classifier, then using them for this redundancy based content authenticity check seems promising

Indeed. But, please keep in mind that I chose this small resolution arbitrarily. It was simply for illustration of how my method can work. Meanwhile, yes, if such assets are already being created, then yes, they can be used for this purpose.

j0sh commented 5 years ago

For example, take a 720p video segment. We can extract a still image from the I-Frame at the start of the segment.

Random sampling the I-Frame might be more robust from a security standpoint.

Any verification should be done on all display frame types; otherwise, tampering with non-reference frames becomes possible. For now, we should avoid making the verification algorithm specific to the bitstream or features of a particular codec (eg, I frames). Operating on fully decoded frames allows us to remain codec agnostic.

At some point we may need to look at specific bitstream features to ensure encoder conformance with the requested renditions (eg, profiles and levels), but that would be an additional layer of checking alongside an image / video quality algorithm.

mkrufky commented 5 years ago

I don't necessarily agree with j0sh's points here. The point of this is for a cursory comparison. It does not need to be run on every single frame - it only needs to be used on IFrames. It does not have to be codec agnostic, as this is intended to be used for comparison of resulting output transcoded streams, only.

mkrufky commented 5 years ago

POC: https://github.com/mkrufky/coersion

j0sh commented 5 years ago

It does not need to be run on every single frame - it only needs to be used on IFrames

Meant it should be run on all frame types [1], rather than just iframes - within a segment, random sampling can still be done on any frame. Apologies if that was unclear.

There would still be a cost to decode those frames, but that may be relatively minor, since the other processing required to verify the frames in a segment may have its own overhead.

It does not have to be codec agnostic, as this is intended to be used for comparison of resulting output transcoded streams, only.

Which codec-specific bitstream features were you thinking of using, other than the notion of I-frames? The more codec-specific features we rely on, the less versatile the verifier becomes.

[1] Updated the original comment to reflect this

ndujar commented 5 years ago

POC: https://github.com/mkrufky/coersion

We have carefully explored the POC and examined the proposed fuzzy match feature. I have taken the liberty of including this metric as _image_matchinstant within the verification-classifier framework for testing purposes (see here the branch and here the aforementioned implementation). Please feel free to review it and give us your feedback. We would love to hear what are your thoughts about it.

Once aggregated, the mean value of this metric over 60 frames has been used for training the current model (One Class Support Vector Machine). Re-scaling has been set to 480x270px. The video dataset considers 4504 well transcoded renditions and 178593 "synthetic" attacks (more on how they were made here. This means we have tested the feature over some 180k videos.

Arguably, in our current implementation the frames are pre-processed for performance purposes (taking only the V in HSV conversion).

To date, the selection of the following features, using only their mean values in time (out of 60 frames per video):

Temporal DCT
Temporal Gaussian
Temporal Gaussian Difference
Temporal Gaussian Difference Threshold
Temporal Texture

was giving the following results:
TPR: 98.80%
TNR: 81.47%
F20: 97.02%

Good news is, combined with the mean of this new metric, we reach slightly better performance:

TPR: 98.66%
TNR: 84.11%
F20: 97.14%

However, it is unclear to me where the macroblocks discussion enter in either the POC or our implementation. Would it be possible to elaborate a bit further?

mkrufky commented 5 years ago

The idea of the coercion POC is to decrease the resolution of the input images to a user-defined, much smaller size, before doing the pixel RGBA comparisons.

The image size reduction allows us to look at a larger area, and take a more fuzzy RGBA value for the pixels in that area by representing it by a smaller area.

It's not clear to me if this is happening in the python code.

ndujar commented 5 years ago

The idea of the coercion POC is to decrease the resolution of the input images to a user-defined, much smaller size, before doing the pixel RGBA comparisons.

The image size reduction allows us to look at a larger area, and take a more fuzzy RGBA value for the pixels in that area by representing it by a smaller area.

It's not clear to me if this is happening in the python code.

Indeed, resizing is done prior to any computation. We have the repository under development, but nonetheless the core of what the verification tool does dwells in the asset_processor folder. .

The video_metrics.py module is thought of as a factory that can later on be selectively called in order to compute the required metrics on a per-frame basis.

Assets are gathered and captured by means of the video_asset_processor.py module, so that several renditions can be compared and verified for the same source video. It is here where the transcoding, random sampling and resizing happen, just before converting the reduced streams into more efficient numpy array structures and putting them in memory. Then, for each reduced stream of each rendition the selected metrics are computed.

Regarding the optimal re-sizing dimensions, we have conducted several experiments on the matter (as mentioned by @yondonfu above). Our conclusion was that the amount of information lost due to the interpolation that happens in the resizing affects mostly to the model's ability to detect false negatives (i.e. those badly coded renditions that pass as valid).

I hope this clarifies a bit. Please don't hesitate if you have any further questions.

livepeer / research

Using macroblocks from three separate transcode rendition jobs to determine content authenticity #36