huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.67k stars 26.93k forks source link

Documentation for HuBERT is Incomplete #33536

Open majauhar opened 1 month ago

majauhar commented 1 month ago

System Info

Who can help?

@ylacombe @muellerzr

There are two issues.

Information

Tasks

Reproduction

This issue could be replicated by following Von Platen's tutorial on finetuning wav2vec 2.0 but instead of wav2vec 2.0, use hubert-base. Please let me know if you require any further information.

Expected behavior

There should be a clear mention about the default values of the special token_ids, in particular the pad_token and the potential issues downstream with any other value. And if the behaviour of compute_metric is not actually intended, taking an arbitrary value of pad_token_id could be considered to make the code token_id invariant.

LysandreJik commented 1 month ago

cc @stevhliu

stevhliu commented 1 month ago

Thanks, would you like to open a PR to add the missing parameters to the docstrings?

majauhar commented 1 month ago

Hey Steven! Sure. I could do that.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

majauhar commented 1 day ago

Hey @stevhliu This issue remained stale for a long time. I just added the config details about the pad, eos, and bos tokens and have made a PR. Let me know if it works. Also, I have only made the changes in the docstring of configuration_hubert.py. This would reflect in the documentation automatically, wouldn't it?

stevhliu commented 1 day ago

Thanks @majauhar! Yes, the docstrings in the documentation are updated from a model's .py file.