Hi! Thank you for the great work on this repo. I have a question regarding the Hubert feature extraction. In your HubertWithKmeans class you extract the features from the Hubert transformer layers using:
If I understand correctly, this extracts features from the last transformer layer (the 12th layer). However, the kmeans quantizer from fairseq which you use in your examples ("hubert_base_ls960_L9_km500.bin") seems to have been trained on the 9th layer. Digging a bit into the fairseq code I found that the output layer for the feature extraction can be specified with:
If I compute the code units with the output layer specified I get different codes than with your code. I think there should be an option to specify the output layer. Or am I missing something? Thank you for the clarification.
Hi! Thank you for the great work on this repo. I have a question regarding the Hubert feature extraction. In your HubertWithKmeans class you extract the features from the Hubert transformer layers using:
embed = self.model(wav_input, features_only = True)
If I understand correctly, this extracts features from the last transformer layer (the 12th layer). However, the kmeans quantizer from fairseq which you use in your examples ("hubert_base_ls960_L9_km500.bin") seems to have been trained on the 9th layer. Digging a bit into the fairseq code I found that the output layer for the feature extraction can be specified with:
embed = self.model(wav_input, features_only = True, output_layer=9)
If I compute the code units with the output layer specified I get different codes than with your code. I think there should be an option to specify the output layer. Or am I missing something? Thank you for the clarification.