speechbrain / benchmarks

This repository contains the SpeechBrain Benchmarks
Apache License 2.0
83 stars 35 forks source link

DASB - Discrete Audio and Speech Benchmark #8

Closed poonehmousavi closed 5 months ago

mravanelli commented 10 months ago

Hi @poonehmousavi, thank you for this PR! Here are my comments:

  1. I propose to move the code in this PR to the main SpeechBrain repo (unstable v0.6 branch). In the benchmark section, we should only import the pretrained models that provide the discrete representations and write the code for training the probing heads of the downstream tasks. In the main repo, we can include interfaces to these models or push all the necessary steps for clustering them (in the case of models like Hubert, WavLM, Wav2vect, etc.)
  2. Specifically, the Hubert Discrete Interface should be placed in the main repo's unstable-v06 branch under https://github.com/speechbrain/speechbrain/tree/unstable-v0.6/speechbrain/lobes/models/huggingface_transformers.
  3. I think the Hubert Discrete Interface has to return both the indexes of the representations and their corresponding centroids. When we use the discrete representations for our downstream tasks, it seems important to use the centroid vectors corresponding to each index, right?
  4. The DiscreteHuBERT's docstring currently resembles that of Hubert. It's crucial to update it to reflect DiscreteHuBERT, including describing important arguments like kmeans and providing a proper example.
  5. I would suggest putting the model on HuggingFace, making it easy to import for us. We can mark it in the ReADME as a work in progress.
  6. If we decide to move this to the main repo, "LibriSpeech_prepare.py" should be a symbolic link.
  7. I suggest changing the name from "cluster" to "clustering" for clarity.
  8. Regarding "train_splits: ["train-clean-100"]," is there a specific reason for limiting us to "train-clean-100"? Have you seen this limitation mentioned in any research papers?
  9. The value of "n_clusters: 10" might not be sufficient. Have you observed this number being used in the literature? Some research papers use larger vocabulary sizes for discrete representations, such as 1024.