Documentation for HuBERT is Incomplete

majauhar commented 1 month ago

System Info

transformers version: 4.44.2
Platform: Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28
Python version: 3.11.9
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: Tesla V100-SXM2-32GB

Who can help?

@ylacombe @muellerzr

There are two issues.

One is with the missing information in the documentation regarding the parameters of the HuBERT model. The init function of HubertConfig has pad_token_id=0, bos_token_id=1, eos_token_id=2 but the information about it missing from the docstring.
- This is concerning because if someone is following the ASR tutorial by Von Platen (https://huggingface.co/blog/fine-tune-wav2vec2-english), the token ids for padding, bos, and eos would not correspond to 0, 1, and 2, respectively.
The other issue is a result of the mismatch between the padding token ids. In HF trainer, when the compute_metric is called during evaluation, it bundles the whole dataset together by padding pred_ids by a value of 0 to the length of the longest sample in the dataset. However, during the decoding, if the token_id doesn't correspond to 0, the decoding would carry one extra letter at the end of the transcription, which would correspond to the token with id 0, thereby generating an incorrect transcription and hence an incorrect CER/WER.

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

This issue could be replicated by following Von Platen's tutorial on finetuning wav2vec 2.0 but instead of wav2vec 2.0, use hubert-base. Please let me know if you require any further information.

Expected behavior

There should be a clear mention about the default values of the special token_ids, in particular the pad_token and the potential issues downstream with any other value. And if the behaviour of compute_metric is not actually intended, taking an arbitrary value of pad_token_id could be considered to make the code token_id invariant.

LysandreJik commented 1 month ago

cc @stevhliu

stevhliu commented 1 month ago

Thanks, would you like to open a PR to add the missing parameters to the docstrings?

majauhar commented 1 month ago

Hey Steven! Sure. I could do that.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

majauhar commented 1 day ago

Hey @stevhliu This issue remained stale for a long time. I just added the config details about the pad, eos, and bos tokens and have made a PR. Let me know if it works. Also, I have only made the changes in the docstring of configuration_hubert.py. This would reflect in the documentation automatically, wouldn't it?

stevhliu commented 1 day ago

Thanks @majauhar! Yes, the docstrings in the documentation are updated from a model's .py file.

huggingface / transformers