Open xavigiro opened 6 years ago
This project is similar to the Smart text project from DLSL Winter School 2017.
Check the lectures on Language Modelling and Word Embeddings on our DLSL Winter School 2017.
If generative model does not work, you may also try to solve this task from a retrieval perspective. This would mean splitting the speech into windows and retrieving the most similar video clip. In order to solve that, you would need to train a joint embedding space (as Amaia Salvador's Pic2Recipe CVPR 2017 or current research by Dídac Surís).
You may have a problem of mode collapse.
You may also consider solving the problem with a supervised approach (as in You said that, BMVC 2017).
Extracting the position over the map with a CNN could be a nice deep learning problem.
Explore the Action space, which is relevant to reinforcement learning, but maybe not so much for deep learning.
Reply to this issue with a link to your slides to present in class. Presentation should be 5 minutes or less. Make sure your slides are public.