Questions on UniVTG - Githubissues

nqx12348 commented 1 year ago

Hi, congratulations on your great sucess! I have two questions about UniVTG:

ActivityNet-Captions is one of the most commonly used datasets in video moment retrieval, but I don't find results on this dataset in the paper. Have you tested UniVTG on this dataset?
I tried your online demo, and find that the model gives completely different predictions for two identical text inputs. Why is this happening?

Thanks!

QinghongLin commented 1 year ago

@nqx12348 , thanks for your interesting and asking! Both are valuable questions.

For activitynet, one issue is that most baselines use the existed video features e.g., C3D; while in our unified co-training, we need to ensure all benchmarks use the same features (e.g., slowfast+clip), thus we need to extract activitynet by ourselves. During the activitynet downloading, we find most RGB video links are invalid and fail to access. Thus, we are unable to align the previous benchmarks setting i.e. #training sample / #testing sample; Similar issues happen in didemo, mad (cannot access videos) benchmarks. thus, we select Charades / NLQ / Tacos since we can fully access all the videos.
Regarding the second question, thank you for reminding! I just discovered this problem and am trying to find the reason. and will update later.

jjihwann commented 1 year ago

@QinghongLin In second problem,

It seems that forward() function in main_gradio.py should contain

model.eval() just before with torch.no_grad(): (may be @ 82L, main_gradio.py)

QinghongLin commented 1 year ago

Hi, @jjihwann Sorry for this stupid mistake, I have updated the correspond code in repo, thank you again!

QinghongLin commented 1 year ago

Based on @jjihwann instruction, now the different predictions results have been solved. Thanks.

QinghongLin commented 1 year ago

close since solve the problem, please open if have new issue.

showlab / UniVTG