Sorry for the late reply. Hope you still need the answers.
The maximum length of single video clip is not limited to 16. These days, many other researchers also use 32 or 64 frames. It depends on your GPU memory. Normally, the longer the video, the better the performance.
It is ok to first extract the video to frames (I normally do this). But there are other libraries that can load video data directly.
I want to try this with a custom video dataset. How can I do it? I have the following questions.