jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
727 stars 40 forks source link

Some notes on HF integration #7

Open NielsRogge opened 2 months ago

NielsRogge commented 2 months ago

Hi,

Niels here from the open-source team at Hugging Face. I discovered your work through the paper page: https://huggingface.co/papers/2408.16532 (featured in daily papers). Congrats on this work!

Great to see you're making the models available on the 🤗 hub. Some small suggestions on how to improve this:

Let me know if you need any help regarding this!

Cheers,

Niels ML Engineer @ HF 🤗

jishengpeng commented 2 months ago

Hi,

Niels here from the open-source team at Hugging Face. I discovered your work through the paper page: https://huggingface.co/papers/2408.16532 (featured in daily papers). Congrats on this work!

Great to see you're making the models available on the 🤗 hub. Some small suggestions on how to improve this:

  • usually we recommend to push each checkpoint to a separate model repository, so in this case, we could have "novateur/wavtokenizer-small-320" and "novateur/wavtokenizer-small-600" repos. This ensures download stats work for your models (assuming you also push a config.yaml to each model repo).
  • great to see you've added the "text-to-speech" tag, just wondering whether "audio-feature-extraction" would be more appropriate? cc @Vaibhavs10
  • I see you've implemented a from_pretrained method yourself, great! Note that we now also have the PyTorchModelHubMixin class which implements this logic (along with more, like push_to_hub and safetensors serialization). Happy to send a PR.

Let me know if you need any help regarding this!

Cheers,

Niels ML Engineer @ HF 🤗

We appreciate the attention from the Hugging Face official team ❤!

  1. Thank you for your suggestion! Upon reviewing other model repositories, we indeed found that they are typically maintained independently. When we open-source WavTokenizer-Small and WavTokenizer-Large in the future, we plan to create separate repositories for each and establish a collection to aggregate them. Currently, in the WavTokenizer-Small version, we have only added the corresponding configuration files. However, we are uncertain whether having multiple checkpoints and configurations will affect the download statistics.
  2. You are correct that WavTokenizer is a representation model rather than a specialized text-to-speech model. Nevertheless, we were unable to find an "audio-feature-extraction" tag on the Hugging Face homepage(audio task), so we have temporarily included all relevant tags.
  3. Thank you for the reminder! We will prioritize developing models based on the Hugging Face framework in our future work.

Best regards!