lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

HubertWithKmeans should have an option to specify the feature extraction layer #136

Closed smdrnks closed 1 year ago

smdrnks commented 1 year ago

Hi! Thank you for the great work on this repo. I have a question regarding the Hubert feature extraction. In your HubertWithKmeans class you extract the features from the Hubert transformer layers using:

embed = self.model(wav_input, features_only = True)

If I understand correctly, this extracts features from the last transformer layer (the 12th layer). However, the kmeans quantizer from fairseq which you use in your examples ("hubert_base_ls960_L9_km500.bin") seems to have been trained on the 9th layer. Digging a bit into the fairseq code I found that the output layer for the feature extraction can be specified with:

embed = self.model(wav_input, features_only = True, output_layer=9)

If I compute the code units with the output layer specified I get different codes than with your code. I think there should be an option to specify the output layer. Or am I missing something? Thank you for the clarification.

lucidrains commented 1 year ago

@smdrnks hey, no problem

yea, that sounds like a good idea, and i've made the output layer configurable at init

smdrnks commented 1 year ago

Awesome, thank you!