Open lucidrains opened 2 years ago
Hi, regarding the GRUCell in the Student network. How can you pass as input a batch of sequence features of shape: (batch_size, time, featuresize) ? For Example a Tensor of size: (16, 60, 2048). From the pytorch documentation, the GRUCell() class can only take inputs with shape: `(N, Hin) or (H{in})(Hin) tensor containing input features where H_{in}Hin = input_size.` Why not use the GRU() Class? I need to keep the temporality in my input and not flatten it with the feature dim.
i'm fairly sure i got the student network correct, as well as the teacher -> student distillation code
but not confident about how the rollouts are done (and the subsequent learning and truncated BPTT)