Space token problem - Githubissues

Hellos, first of all thanks for the lovely code!

I'm trying to fine tune XLSR-53 with some French data, code is just from the examples directory:

model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53", device=device)
output_dir = "wav2vec_finetuned_fr"

alphabet = ["a", "b", "c", "d", "e", "f", "g", "h", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'", "-", "é", "à", "è", "ù", "ç", "â", "ê", "î", "ô", "û", "ë", "ï", "ü", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
token_set = TokenSet(alphabet)

training_args = TrainingArguments(
    learning_rate=3e-4,
    max_steps=1000,
    eval_steps=200,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
)
model_args = ModelArguments(
    activation_dropout=0.1,
    hidden_dropout=0.1,
)

# and finally, fine-tune your model
model.finetune(
    output_dir,
    train_data=train_data,
    eval_data=eval_data, # the eval_data is optional
    token_set=token_set,
    training_args=training_args,
    model_args=model_args,

However I get a training error:

  File "/usr/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/usr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1702, in forward
    raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")
ValueError: Label values must be <= vocab_size: 56

Spaces are the problems, id 56 corresponds to whitespace. Here's an example sentence tokenized:

[15, 8, 56, 16, 8, 21, 2, 21, 8, 3, 12, 56, 15, 30, 56, 0, 22, 31, 19, 18, 15, 12, 2, 8, 56, 0, 56, 16, 0, 21, 20, 24, 32, 56, 48, 56, 1, 24, 23, 22, 56, 2, 18, 17, 23, 21, 8, 56, 21, 0, 2, 12, 17, 10, 56]
le mercredi l' as-police a marqué 3 buts contre racing

Afais from the code, special tokens and spaces are added by the Token set code. What am I doing wrong? :blush:

jonatasgrosman / huggingsound

Space token problem #86