gkordo / s2vs

Authors official PyTorch implementation of the "Self-Supervised Video Similarity Learning" [CVPRW 2023]
MIT License
37 stars 2 forks source link

Toy example with two videos #3

Closed benemana closed 11 months ago

benemana commented 11 months ago

Hi, I tested your pretrained model using the two videos inside data/examples.

Starting from the suggestions you provided, I wrote the following code

import torch
from utils import load_video
from model.similarity_network import ViSiL
import evaluation as eval # This is your evaluation.py module

feat_extractor = torch.hub.load('gkordo/s2vs:main', 'resnet50_LiMAC')
s2vs_dns = torch.hub.load('gkordo/s2vs:main', 's2vs_dns')
s2vs_vcdb = torch.hub.load('gkordo/s2vs:main', 's2vs_vcdb')

# Load the two videos from the video files
query_video = torch.from_numpy(load_video('./data/examples/video1/'))
target_video = torch.from_numpy(load_video('./data/examples/video2/'))

# Initialize pretrained ViSiL model
#model = ViSiL(pretrained='s2vs_dns').to('cuda')
model = SimilarityNetwork['ViSiL'].get_model(pretrained='s2vs_dns').to('cuda')
model.eval()

# Extract features of the two videos
query_features = eval.extract_features(feat_extractor.to('cuda'), query_video.to('cuda'))
target_features = eval.extract_features(feat_extractor.to('cuda'), target_video.to('cuda'))

# Calculate similarity between the two videos
similarity = model.calculate_video_similarity(query_features, target_features)
print(similarity)

The results I got are:

Since video1 and 2 are completely different, I would have expected a lower value for the similarity score. I'm mainly interested in the copy detection task and I wonder if 0.79 can actually be considered a "low value" such that I can argue that the two videos are not potential copies.

Maybe I'm missing something or my code is wrong.

Any help would be really appreciated.

Thank you again for this work

gkordo commented 11 months ago

Hi @benemana. You need to first use the index_video of the similarity model to estimate the similarity between the two videos correctly.

Modifying your code, this would be as follows:

import torch
from utils import load_video
import evaluation as eval # This is your evaluation.py module

# Load the two videos from the video files
query_video = torch.from_numpy(load_video('./data/examples/video1/'))
target_video = torch.from_numpy(load_video('./data/examples/video2/'))

# Initialize pretrained ViSiL model
feat_extractor = torch.hub.load('gkordo/s2vs:main', 'resnet50_LiMAC').to('cuda')
model = torch.hub.load('gkordo/s2vs:main', 's2vs_dns').to('cuda')
model.eval()

# Extract features of the two videos
query_features = eval.extract_features(feat_extractor, query_video.to('cuda'))
target_features = eval.extract_features(feat_extractor, target_video.to('cuda'))

# Index the two videos with model
query_indexed_features = model.index_video(query_features)
target_indexed_features = model.index_video(target_features)

# Calculate similarity between the two videos
similarity = model.calculate_video_similarity(query_indexed_features, target_indexed_features)
print(similarity)

Please let me know if that works.

benemana commented 11 months ago

EDIT: After further readings, I update the code in the following way

[...]
# Extract features of the two videos
query_features = eval.extract_features(feat_extractor.to('cuda'), query_video.to('cuda'))
target_features = eval.extract_features(feat_extractor.to('cuda'), target_video.to('cuda'))

query_index = model.index_video(query_features)
target_index = model.index_video(target_features)

# Calculate similarity between the two videos
similarity = model.calculate_video_similarity(query_index, target_index)

Now the results appear to be much more accurate, and I noticed that completely different videos get negative similarity score.

benemana commented 11 months ago

Hi @benemana. You need to first use the index_video of the similarity model to estimate the similarity between the two videos correctly.

Modifying your code, this would be as follows:

import torch
from utils import load_video
import evaluation as eval # This is your evaluation.py module

# Load the two videos from the video files
query_video = torch.from_numpy(load_video('./data/examples/video1/'))
target_video = torch.from_numpy(load_video('./data/examples/video2/'))

# Initialize pretrained ViSiL model
feat_extractor = torch.hub.load('gkordo/s2vs:main', 'resnet50_LiMAC').to('cuda')
model = torch.hub.load('gkordo/s2vs:main', 's2vs_dns').to('cuda')
model.eval()

# Extract features of the two videos
query_features = eval.extract_features(feat_extractor, query_video.to('cuda'))
target_features = eval.extract_features(feat_extractor, target_video.to('cuda'))

# Index the two videos with model
query_indexed_features = model.index_video(query_features)
target_indexed_features = model.index_video(target_features)

# Calculate similarity between the two videos
similarity = model.calculate_video_similarity(query_indexed_features, target_indexed_features)
print(similarity)

Please let me know if that works.

Thank you so much, this is pretty similar to the new version of the code I implemented yesterday evening, as reported in the EDIT message above.

Only one question: according to your experience, which similarity threshold would be reasonable for the task of video copy detection?

I took some experiments with a target video A and some query videos, X, Y, and Z:

  1. X was an edited version of A, stacked with a large white banner containing subtitles, and the similarity score given by the network was about 0.95
  2. Y was an edited version of X, simply cutting it to the first 30 seconds (out of a total of 2minutes and 30 seconds) and the similarity score was about -0.1
  3. Z is a completely different video from A, and the similarity score was about -0.9

So, from what I'm seeing here, scores that are slightly negative could still be an indicator for a potential video copy.

Thank you again for your support

gkordo commented 11 months ago

Unfortunately, this is not an easy question to answer and needs more digging. Some usual factors that should be taken into account for this decision are the queries that you anticipate, the database videos that you have, the underlying application and the precision level that you want it to operate. In my experience, a value around 0. is a rather safe threshold, but the above factors can shift this value significantly.

The safest practice is to calibrate this threshold on a representative annotated dataset so as to select the value for the precision score that you want your system to operate.

gkordo commented 11 months ago

I am closing this issue as it has been resolved. Feel free to open it if there is any unaddressed issue or ask anything related to it.