Dynamic padding did not work as expected for custom audio dataset

changyeli commented 2 years ago

Describe the bug

Hello everyone, following-up of this post and official blog post on fine-tuning Wav2Vec model. Turns out it did not pad correctly w.r.t. input features.

Steps to reproduce the bug


def map_speech_to_array(batch):
    """
    map the wav file to audio signals

    :param batch: the loaded dataset, with audio file location as "column"
    :type batch: datasets.dataset_dict.DatasetDict
    """
    speech_array, sampling_rate = sf.read(batch["audio_loc"])
    batch["speech"] = speech_array
    batch["sampling_rate"] = sampling_rate
    batch["audio_loc"] = batch["audio_loc"]
    batch["text"] = batch["text"]
    return batch

def prepare_dataset(batch):
    """
    data preprocess with Wav2Vec customized processor

    :param batch: the loaded dataset
    :type batch: datasets.dataset_dict.DatasetDict
    :param processor: the customized 
    :type processor: transformers.models.wav2vec2.processing_wav2vec2.Wav2Vec2Processor
    """
    batch["input_values"] = processor(batch["speech"], sampling_rate=batch["sampling_rate"]).input_values[0]
    with processor.as_target_processor():
        labels = processor(batch["text"]).input_ids
        batch["labels"] = labels
    return batch
loaded_dt = load_dataset(
    'csv',
    data_files={"train": "../manifest/train.csv",
                "test": "../manifest/test.csv"})

loaded_dt = loaded_dt.map(
    map_speech_to_array)
loaded_dt = loaded_dt.map(prepare_dataset)
train_dt = loaded_dt["train"]

I tried the following code for padding investigation

tokenizer = Wav2Vec2CTCTokenizer(
    "../vocab/vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|",
    padding=True, truncation=True)
feature_extractor = Wav2Vec2FeatureExtractor(
    feature_size=1, sampling_rate=SAMPLE_RATE, padding_value=0.0,
    do_normalize=True, return_attention_mask=False,
    padding=True)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
train_dt =train_dt.map(map_speech_to_array)
train_dt = train_dt.map(prepare_dataset)
input_features = []
label_features = []
for i, item in enumerate(train_dt):
  input_features.append({"input_values":train_dt[i]["input_values"]})
  label_features.append({"input_ids":train_dt[i]["labels"]})
print(len(label_features[0]["input_ids"]))
batch = processor.pad(
  input_features,
  padding=True,
  return_tensors="pt")
print(batch)
with processor.as_target_processor():
  labels_batch = processor.pad(
      label_features,
      padding=True,
      return_tensors="pt")

Expected results

The code above is line-by-line breakdown of DataCollatorCTCWithPadding provided in the official blog. It should start to fine-tune Wav2Vec model.

Actual results

 labels_batch = processor.pad(
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py", line 150, in pad
    return self.current_processor.pad(*args, **kwargs)
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2776, in pad
    padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2312, in _get_padding_truncation_strategies
    if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
TypeError: '<' not supported between instances of 'NoneType' and 'int'

I got a very similar error when I used Trainer

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id)
model.to(DEVICE)
model.freeze_feature_extractor()
data_collator = DataCollatorCTCWithPadding(
    processor=processor, padding=True)
training_args = TrainingArguments(
    output_dir="../fine-tuned/wav2vec",
    group_by_length=True,
    per_device_train_batch_size=32,
    evaluation_strategy="steps",
    num_train_epochs=30,
    fp16=True,
    gradient_checkpointing=True, 
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=1000,
    save_total_limit=2,
    push_to_hub=False)
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=loaded_dt["train"],
    eval_dataset=loaded_dt["test"],
    tokenizer=processor.feature_extractor)
trainer.train()

Which returns:

trainer.train()
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/trainer.py", line 1306, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "wav2vec_test.py", line 73, in __call__
    labels_batch = self.processor.pad(
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py", line 150, in pad
    return self.current_processor.pad(*args, **kwargs)
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2776, in pad
    padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2312, in _get_padding_truncation_strategies
    if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
TypeError: '<' not supported between instances of 'NoneType' and 'int'
  0%|

Environment info

datasets version: 1.18.3
Platform: Linux-4.15.0-166-generic-x86_64-with-glibc2.17
Python version: 3.8.12
PyArrow version: 6.0.1

Any suggestions? Thanks in advance.

albertvillanova commented 2 years ago

Hi @changyeli, thanks for reporting.

According to the TypeError message you get, self.pad_token_id is None and code line

  File "/home/lixx3013/anaconda3/envs/toolkit/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2312, in _get_padding_truncation_strategies
    if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):

is trying to compare it with 0 int value.

I think this is an issue with transformers instead of datasets. I'm transferring this issue to the transformers team.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers