Hello,
My question might seem too naive but I'm confused by the arguments passed to the S4ND model (dim, d_model, d_state, channels, out_channels, d_output, ...) and the expected input shape.
Assume I have a video of shape (batch_size, nb_frames, nb_channels, height, width) = (1, 30, 3, 128, 128), how should I reshape the video to pass it to the S4ND model, and what arguments of the model should be adjusted accordingly?
Also, if the desired output is a sequence of labels of length=nb_frames (so that each image in the video will get a label), which argument should be adjusted? d_output or out_channels or...?
I would highly appreciate your prompt response.
Hello, My question might seem too naive but I'm confused by the arguments passed to the S4ND model (dim, d_model, d_state, channels, out_channels, d_output, ...) and the expected input shape. Assume I have a video of shape (batch_size, nb_frames, nb_channels, height, width) = (1, 30, 3, 128, 128), how should I reshape the video to pass it to the S4ND model, and what arguments of the model should be adjusted accordingly? Also, if the desired output is a sequence of labels of length=nb_frames (so that each image in the video will get a label), which argument should be adjusted? d_output or out_channels or...? I would highly appreciate your prompt response.