Closed jjihwann closed 1 year ago
Hi @jjihwann
Thanks for your interest! I have updated the requirements as well as related codes for demo, can you re-try it and let me know? Thanks!
Actually, I thought that your requirements can be reduced,
so I used
conda create -n univtg python==3.8.1
pip install torch==1.12.1 gradio numpy==1.24.2 ffmpeg-python==0.2.0 \
torchvision==0.13.1 ftfy==6.1.1 regex==2022.10.31 tabulate==0.9.0 \
scipy==1.10.0
and it seems working well now.
Also, I don't know why but --resume ./results/omni/model_best.ckpt does not work, so I modify the default value of config.py, 96L
Moreover,
the code occurs an error if
Also, it seems that txt2clip() function in video_extractor.py should contain the following code before return.
np.savez(os.path.join("./tmp", 'txt.npz'), features=text_feature)
Sorry for poor grammer 😂
awesome! thanks for your detailed suggestion. I have updated the requirements, and updated a video in current repo, as well as tmp;
For the following codes, sorry for bugs, actually i have updated it in yesterday, which should be able to run smoothly. you can compare them and i suggest you replace your current codes by my updated ones.
Please let me know if you have any issues, and whether you can run successfully.
Excellent! I think it works well now!
But I still have a small question,,
What is the main difference between "foreground" and "saliency"?
In your paper, I understood that "foreground" is discrete value and "saliency" is continuous, but real values in code were both continuous.
I felt that they have similar roles. Could you explain more for me?
Hi, @jjihwann , You propose a good question. Yes, foreground and saliency head both predict continuous score when inference. but for training, they are supported by different type supervision e.g., binary classification and contrastive learning. We introduce them so that we can integrate both supervision to improve the model. this fashion is very similar to image-text matching and image-text contrastive learning in vision-language pretraining tasks. e.g., https://arxiv.org/abs/2201.12086, https://arxiv.org/pdf/2107.07651.pdf.
But in practical, you can use foreground head or saliency head flexibly, or ensemble their scores together to get more stable prediction. one difference is that saliency score is more flexible for computation i.e., you can calculate the saliency score between any video and any text by dot product operation, but for foreground prediction, you need to input a pair of video-text to get the corresponding score.
Hope my response can answer your question!
Thanks for your kind explanation, I totally understood.
Thank you!
close since resolve the issues.
Hi! Thanks for your nice research and codes.
I'm trying to set the environments to perform your code, but it doesn't work, because of many requirements that end with @ file:///~~
My terminal said: ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/croot/aiohttp_1670009560265/work'
How can I fix it? +++ In addition, similar with the cause of first issue,
I changed importing area of
video_extractor.py
to