Fix conditioning to allow speech in-filling

lucasnewman commented 9 months ago

There are a few tweaks to conditioning needed to allow speech in-filling as described in the paper to work correctly. This is a small code change but a relatively large functional change, so I'm open to discussion!

1) During training, use the un-noised target as the conditioning (aka X_ctx in the paper) instead of the X_t value for the previous timestep in the flow. Without this, the conditioning is noisy and the network can't correctly use the full context to predict the flow for the in-filling portion.

2) Remove the MLM-style training objective and only use fractional masking for training efficiency, since the target is the final step of the flow based on the X_t intermediate step and not an unmasked version of the input.

3) Don't mask the input X_t of the flow, since that's basically throwing away information sampled at previous timesteps of the ODE. The network will learn to sample X_ctx and generate in-filled speech where appropriate based on the conditioning mask.

With these changes, I was able to train a model that supports speech in-filling with style transfer at inference time! 🚀

lucidrains commented 9 months ago

@lucasnewman man... slow 👏

lucidrains commented 9 months ago

@lucasnewman really appreciate your bringing this repository over the finish line. this will be helpful for countless papers down the line

atmbb commented 6 months ago

@lucasnewman Thanks for conversation. I'm sorry to bother you, but could you explain codeline 993? "cond = default(cond, target)"

I think target is u_t (=x1-x0). x1 is gt-mel or acoustic token, x0 is random noise. But x_ctx in the paper should be masked gt-mel. (Then, cond is not None in your training code?)

lucidrains / voicebox-pytorch

Fix conditioning to allow speech in-filling #39