Video models require 5D inputs (batch, channels, frames, height, width). But most of the parameterization, transforms and rendering functions in Lucent assume 4D inputs.
A simple workaround is to initialize a batch of images where batchsize = batch * frames. Then, inside the render_vis function, just before we pass the input to the model, we transpose and unsqueeze the input to a 5D shape.
Also, ideally we want the frames to be continuous, which suggests an objective to maximize alignment between the frames. Maybe objectives.alignment("input") will be sufficient for this.
Video models require 5D inputs (batch, channels, frames, height, width). But most of the parameterization, transforms and rendering functions in Lucent assume 4D inputs.
A simple workaround is to initialize a batch of images where batchsize = batch * frames. Then, inside the
render_vis
function, just before we pass the input to the model, we transpose and unsqueeze the input to a 5D shape.Specifically, in render.py, replace https://github.com/greentfrapp/lucent/blob/044317a7b395220e6a27fd890c35abc081c5d1c8/lucent/optvis/render.py#L73 with
But I'm wondering if there is a better solution.
Also, ideally we want the frames to be continuous, which suggests an objective to maximize alignment between the frames. Maybe objectives.alignment("input") will be sufficient for this.