Thanks for you share this excellent reseach! I'm confused some line in 'model.py'.

input_data : ndarray Must be three dimensional, where first dimension is the number of input video stream(s), the second is the number of time steps, and the third is the size of the visual encoder output for each time step. Shape of tensor = (n_vids, L, input_size).

--- (n_vids,L=video_length,D=500)=(1,video_length,500) ---

I ran your codes and pre_trained model on THUMOS14 , and then updated results in recall_eval.ipynb, and then plot_results.ipynb.

The curve 'sst_demo' (using sst_demo_th14_k32.hkl for predicting) is much lower than DAPs and SST in the average_recall figure. The highest average recall achieves 0.588 while 0.637 in your figure.

I am wondering if when predicting each video is the input_data size should be (1,video_length,500) according to your paper.

shyamal-b / sst

Uncertain about testing input_data size. #12

--- (n_vids,L=video_length,D=500)=(1,video_length,500) ---