simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

How do you look at retrieved text data? #44

Open mtsandmeyer opened 2 years ago

mtsandmeyer commented 2 years ago

Hello, I have run the training and embedding extraction and I'm wondering how I can see any examples of text that the model retrieved.

The embeddings and h5 files seem to be mostly numeric--How do I see string data?

simon-ging commented 2 years ago

Hi, there's no code available for this, but I will try to give a short explanation on what to do, starting from the extracted embeddings in h5 format.

In the extracted embeddings the field "key" tells you which video is at which index. Select one key (video you want to retrieve from) and remember it's position in the list. Then take the corresponding row of matrix "vid_emb", this will give you the embedding vector for said video.

Then you do the retrieval and compare this vector to the entire "par_emb" matrix with cosine similarity (check coot/loss_fn.py), this gives you the distances. Retrieve the indices of the vectors in the "par_emb" matrix with the e.g. the top 5 highest similarity to your input video vector. Go back to "key" and get the keys with that indices - now you have the top 5 model predictions for the video you started with.

Finally check data/$dataset/meta_all.json to get the corresponding text for a given key.

For low-level retrieval (clip-sentence) it's a little more complicated as you have to take into account the field "clip_num" which tells you the number of clips per video, you need it to read the flat "clip_emb" and "sent_emb" fields into hierarchical video with a variable number of clips and sentences each.