HubertWithKmeans should have an option to specify the feature extraction layer

smdrnks commented 1 year ago

Hi! Thank you for the great work on this repo. I have a question regarding the Hubert feature extraction. In your HubertWithKmeans class you extract the features from the Hubert transformer layers using:

embed = self.model(wav_input, features_only = True)

If I understand correctly, this extracts features from the last transformer layer (the 12th layer). However, the kmeans quantizer from fairseq which you use in your examples ("hubert_base_ls960_L9_km500.bin") seems to have been trained on the 9th layer. Digging a bit into the fairseq code I found that the output layer for the feature extraction can be specified with:

embed = self.model(wav_input, features_only = True, output_layer=9)

If I compute the code units with the output layer specified I get different codes than with your code. I think there should be an option to specify the output layer. Or am I missing something? Thank you for the clarification.

lucidrains commented 1 year ago

@smdrnks hey, no problem

yea, that sounds like a good idea, and i've made the output layer configurable at init

smdrnks commented 1 year ago

Awesome, thank you!

lucidrains / audiolm-pytorch

HubertWithKmeans should have an option to specify the feature extraction layer #136