generation form of the inference

lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

MIT License

2.39k stars 255 forks source link

generation form of the inference #201

Closed Hit1ron closed 1 year ago

Hit1ron commented 1 year ago

In the generation of the coarse transformer and the fine transformer, the acoustic tokens of prime_wav are not used, but are used in the generating continuation mode in the paper.

lucidrains commented 1 year ago

yea, you are right! this one is a bit tricky if the sampled audio is of different lengths

let me think about it before executing; should be able to knock out this issue by week's end

lucidrains commented 1 year ago

@Hit1ron do you want to see if https://github.com/lucidrains/audiolm-pytorch/commit/896b240757a68b107964e93a6c8b7943ec819ad3 fixes the issue? i'll address variable lengthed prompts at a future date

Hit1ron commented 1 year ago

@lucidrains yes, the issue is fixed. I have a suggestion that the init_coarse_time_step and init_fine_time_step parameter should be placed before coarse_token_ids' rearrange and fine_tokens_ids' rearrange, so that setting the max_time_steps parameter is easier, no need to consider the number of coarse and fine quantizers.

lucidrains commented 1 year ago

@Hit1ron oh yea, that is problematic

i've corrected the initial fine acoustic token timestep, and just opted to set the initial coarse acoustic token timestep to 0 and let the network decide when to eos