Open majauhar opened 1 month ago
cc @stevhliu
Thanks, would you like to open a PR to add the missing parameters to the docstrings?
Hey Steven! Sure. I could do that.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey @stevhliu This issue remained stale for a long time. I just added the config details about the pad, eos, and bos tokens and have made a PR. Let me know if it works. Also, I have only made the changes in the docstring of configuration_hubert.py
. This would reflect in the documentation automatically, wouldn't it?
Thanks @majauhar! Yes, the docstrings in the documentation are updated from a model's .py
file.
System Info
transformers
version: 4.44.2Who can help?
@ylacombe @muellerzr
There are two issues.
One is with the missing information in the documentation regarding the parameters of the HuBERT model. The
init
function ofHubertConfig
haspad_token_id=0, bos_token_id=1, eos_token_id=2
but the information about it missing from the docstring.The other issue is a result of the mismatch between the padding token ids. In
HF trainer
, when thecompute_metric
is called during evaluation, it bundles the whole dataset together by paddingpred_ids
by a value of 0 to the length of the longest sample in the dataset. However, during the decoding, if thetoken_id
doesn't correspond to 0, the decoding would carry one extra letter at the end of the transcription, which would correspond to the token with id 0, thereby generating an incorrect transcription and hence an incorrect CER/WER.Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This issue could be replicated by following Von Platen's tutorial on finetuning
wav2vec 2.0
but instead ofwav2vec 2.0
, usehubert-base
. Please let me know if you require any further information.Expected behavior
There should be a clear mention about the default values of the special
token_ids
, in particular thepad_token
and the potential issues downstream with any other value. And if the behaviour ofcompute_metric
is not actually intended, taking an arbitrary value ofpad_token_id
could be considered to make the code token_id invariant.