huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.62k stars 470 forks source link

Error on fine tuning #100

Closed xellDart closed 2 months ago

xellDart commented 3 months ago

Hi, I have an error when I try to fine tuning on my dataset

My dataset is: https://huggingface.co/datasets/xellDart13/audio_transcrib_jenna https://huggingface.co/datasets/xellDart13/audio_transcrib_jenna_test-tagged

audios = [audio[: min(l, self.max_length)] for audio, l in zip(audios, len_audio)]
TypeError: slice indices must be integers or None or have an __index__ method

And I run training with

!accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "parler-tts/parler-tts-mini-v1" \
    --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
    --description_tokenizer_name "parler-tts/parler-tts-mini-v1" \
    --prompt_tokenizer_name "parler-tts/parler-tts-mini-v1" \
    --report_to "tensorboard" \
    --overwrite_output_dir true \
    --train_dataset_name "xellDart13/audio_transcrib_jenna" \
    --train_metadata_dataset_name "xellDart13/audio_transcrib_jenna_test-tagged" \
    --train_dataset_config_name "default" \
    --train_split_name "train" \
    --eval_dataset_name "xellDart13/audio_transcrib_jenna" \
    --eval_metadata_dataset_name "xellDart13/audio_transcrib_jenna_test-tagged" \
    --eval_dataset_config_name "default" \
    --eval_split_name "train" \
    --max_eval_samples 8 \
    --per_device_eval_batch_size 6 \
    --target_audio_column_name "audio" \
    --description_column_name "text_description" \
    --prompt_column_name "text" \
    --max_duration_in_seconds 25 \
    --min_duration_in_seconds 2.0 \
    --max_text_length 400 \
    --preprocessing_num_workers 2 \
    --do_train true \
    --num_train_epochs 3 \
    --eval_steps 10 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing true \
    --per_device_train_batch_size 6 \
    --learning_rate 0.00008 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.01 \
    --lr_scheduler_type "constant_with_warmup" \
    --warmup_steps 50 \
    --logging_steps 2 \
    --freeze_text_encoder true \
    --audio_encoder_per_device_batch_size 4 \
    --dtype "float16" \
    --seed 456 \
    --output_dir "./output_dir_training/" \
    --temporary_save_to_disk "./audio_code_tmp/" \
    --save_to_disk "./tmp_dataset_audio/" \
    --dataloader_num_workers 2 \
    --predict_with_generate \
    --do_eval \
    --include_inputs_for_metrics \
    --group_by_length true
agronholm commented 3 months ago

What's the return value from min(l, self.max_length)?

ylacombe commented 3 months ago

Hey , thanks for opening the issue, could you send a bigger log trace, so that I can have a better idea where it's happening ?

Aunali321 commented 3 months ago

Same issue. Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-08-13 19:03:46.003111: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-13 19:03:46.023949: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-13 19:03:46.030655: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-13 19:03:46.046192: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-13 19:03:47.216399: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Flash attention 2 is not installed
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
08/13/2024 19:04:00 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
08/13/2024 19:04:00 - INFO - __main__ - Training/evaluation parameters ParlerTTSTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.99,
adam_epsilon=1e-08,
audio_encoder_per_device_batch_size=4,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
compute_clap_similarity_metric=True,
compute_noise_level_metric=True,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=2,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
dtype=float16,
eval_accumulation_steps=None,
eval_dataloader_num_workers=0,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=18,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=True,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=8e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output_dir_training/runs/Aug13_19-03-50_0568c1b5750a,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=2,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=constant_with_warmup,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
noise_level_to_compute_clean_wer=25,
num_train_epochs=2.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=./output_dir_training/,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=2,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./output_dir_training/,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=456,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=50,
weight_decay=0.01,
)
loading configuration file preprocessor_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--dac_44khZ_8kbps/snapshots/db52bea859d9411e0beb44a3ea923a8731ee4197/preprocessor_config.json
Feature extractor EncodecFeatureExtractor {
  "chunk_length_s": null,
  "feature_extractor_type": "EncodecFeatureExtractor",
  "feature_size": 1,
  "overlap": null,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 44100
}

loading file spiece.model from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/spiece.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer_config.json
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
loading file spiece.model from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/spiece.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer_config.json
08/13/2024 19:04:02 - WARNING - __main__ - Disabling fast tokenizer warning: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3231-L3235
Combining datasets...:   0% 0/1 [00:00<?, ?it/s]08/13/2024 19:04:07 - INFO - __main__ - Merging 1rsh/gujarati-f-openslr - train with Cossale/gujarati-f-openslr-tags-2k-tagged - train
08/13/2024 19:04:12 - INFO - __main__ - REMOVE text from dataset 1rsh/gujarati-f-openslr - dataset_dict['split']
Combining datasets...: 100% 1/1 [00:09<00:00,  9.97s/it]
Combining datasets...:   0% 0/1 [00:00<?, ?it/s]08/13/2024 19:04:15 - INFO - __main__ - Merging 1rsh/gujarati-f-openslr - train with Cossale/gujarati-f-openslr-tags-2k-tagged - train
08/13/2024 19:04:18 - INFO - __main__ - REMOVE text from dataset 1rsh/gujarati-f-openslr - dataset_dict['split']
Combining datasets...: 100% 1/1 [00:06<00:00,  6.66s/it]
config.json: 100% 7.42k/7.42k [00:00<00:00, 9.96MB/s]
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/config.json
Model config ParlerTTSConfig {
  "_name_or_path": "/fsx/yoach/tmp/artefacts/training-400M-punctuated-v2/",
  "architectures": [
    "ParlerTTSForConditionalGeneration"
  ],
  "audio_encoder": {
    "_name_or_path": "ylacombe/dac_44khZ_8kbps",
    "add_cross_attention": false,
    "architectures": [
      "DACModel"
    ],
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "codebook_size": 1024,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "frame_rate": 86,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "latent_dim": 1024,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_bitrate": 8,
    "model_type": "dac",
    "no_repeat_ngram_size": 0,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_codebooks": 9,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sampling_rate": 44100,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "float32",
    "torchscript": false,
    "typical_p": 1.0,
    "use_bfloat16": false
  },
  "decoder": {
    "_name_or_path": "/fsx/yoach/tmp/artefacts/decoder_400M/",
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_cross_attention": true,
    "architectures": [
      "ParlerTTSForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 1025,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "cross_attention_implementation_strategy": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "dropout": 0.1,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 1024,
    "exponential_decay_length_penalty": null,
    "ffn_dim": 4096,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_factor": 0.02,
    "is_decoder": true,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layerdrop": 0.0,
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 4096,
    "min_length": 0,
    "model_type": "parler_tts_decoder",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_codebooks": 9,
    "num_cross_attention_key_value_heads": 16,
    "num_hidden_layers": 24,
    "num_key_value_heads": 16,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 1024,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rope_embeddings": false,
    "rope_theta": 10000.0,
    "scale_embedding": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "float32",
    "torchscript": false,
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": true,
    "vocab_size": 1088
  },
  "decoder_start_token_id": 1025,
  "is_encoder_decoder": true,
  "model_type": "parler_tts",
  "pad_token_id": 1024,
  "prompt_cross_attention": false,
  "text_encoder": {
    "_name_or_path": "google/flan-t5-base",
    "add_cross_attention": false,
    "architectures": [
      "T5ForConditionalGeneration"
    ],
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": 0.0,
    "cross_attention_hidden_size": null,
    "d_ff": 2048,
    "d_kv": 64,
    "d_model": 768,
    "decoder_start_token_id": 0,
    "dense_act_fn": "gelu_new",
    "diversity_penalty": 0.0,
    "do_sample": false,
    "dropout_rate": 0.1,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 1,
    "exponential_decay_length_penalty": null,
    "feed_forward_proj": "gated-gelu",
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_factor": 1.0,
    "is_decoder": false,
    "is_encoder_decoder": true,
    "is_gated_act": true,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_epsilon": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "t5",
    "n_positions": 512,
    "no_repeat_ngram_size": 0,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_decoder_layers": 12,
    "num_heads": 12,
    "num_layers": 12,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_past": true,
    "output_scores": false,
    "pad_token_id": 0,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "relative_attention_max_distance": 128,
    "relative_attention_num_buckets": 32,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": {
      "summarization": {
        "early_stopping": true,
        "length_penalty": 2.0,
        "max_length": 200,
        "min_length": 30,
        "no_repeat_ngram_size": 3,
        "num_beams": 4,
        "prefix": "summarize: "
      },
      "translation_en_to_de": {
        "early_stopping": true,
        "max_length": 300,
        "num_beams": 4,
        "prefix": "translate English to German: "
      },
      "translation_en_to_fr": {
        "early_stopping": true,
        "max_length": 300,
        "num_beams": 4,
        "prefix": "translate English to French: "
      },
      "translation_en_to_ro": {
        "early_stopping": true,
        "max_length": 300,
        "num_beams": 4,
        "prefix": "translate English to Romanian: "
      }
    },
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": true,
    "vocab_size": 32128
  },
  "torch_dtype": "float32",
  "transformers_version": "4.43.3",
  "vocab_size": 32128
}

model.safetensors: 100% 2.59G/2.59G [02:08<00:00, 20.1MB/s]
loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/model.safetensors
Generate config GenerationConfig {
  "decoder_start_token_id": 1025,
  "pad_token_id": 1024
}

/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Generate config GenerationConfig {
  "bos_token_id": 1025,
  "eos_token_id": 1024,
  "pad_token_id": 1024
}

All model checkpoint weights were used when initializing ParlerTTSForConditionalGeneration.

All the weights of ParlerTTSForConditionalGeneration were initialized from the model checkpoint at parler-tts/parler_tts_mini_v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ParlerTTSForConditionalGeneration for predictions without further training.
generation_config.json: 100% 267/267 [00:00<00:00, 794kB/s]
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1025,
  "decoder_start_token_id": 1025,
  "do_sample": true,
  "eos_token_id": 1024,
  "guidance_scale": 1.0,
  "max_length": 2580,
  "min_new_tokens": 50,
  "pad_token_id": 1024
}

gathered_tensor tensor([0], device='cuda:0')
/usr/local/lib/python3.10/dist-packages/multiprocess/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
Filter (num_proc=2): 100% 1997/1997 [00:00<00:00, 9067.08 examples/s]
Filter (num_proc=2): 100% 8/8 [00:00<00:00, 41.61 examples/s]
preprocess datasets (num_proc=2): 100% 1997/1997 [00:01<00:00, 1841.35 examples/s]
preprocess datasets (num_proc=2): 100% 8/8 [00:00<00:00, 16.95 examples/s]
08/13/2024 19:06:45 - INFO - __main__ - *** Encode target audio with encodec ***
  0% 0/500 [00:00<?, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
 39% 195/500 [02:27<03:50,  1.32it/s]
Traceback (most recent call last):
  File "/content/parler-tts/./training/run_parler_tts_training.py", line 1186, in <module>
    main()
  File "/content/parler-tts/./training/run_parler_tts_training.py", line 494, in main
    for i, batch in enumerate(tqdm(data_loader, disable=not accelerator.is_local_main_process)):
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 464, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 705, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/content/parler-tts/training/data.py", line 33, in __call__
    audios = [audio[: min(l, self.max_length)] for audio, l in zip(audios, len_audio)]
  File "/content/parler-tts/training/data.py", line 33, in <listcomp>
    audios = [audio[: min(l, self.max_length)] for audio, l in zip(audios, len_audio)]
TypeError: slice indices must be integers or None or have an __index__ method

wandb: You can sync this run to the cloud by running:
wandb: wandb sync /content/parler-tts/wandb/offline-run-20240813_190359-qifq19ti
wandb: Find logs at: ./wandb/offline-run-20240813_190359-qifq19ti/logs
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', '--model_name_or_path', 'parler-tts/parler_tts_mini_v0.1', '--feature_extractor_name', 'parler-tts/dac_44khZ_8kbps', '--description_tokenizer_name', 'parler-tts/parler_tts_mini_v0.1', '--prompt_tokenizer_name', 'parler-tts/parler_tts_mini_v0.1', '--report_to', 'wandb', '--overwrite_output_dir', 'true', '--train_dataset_name', '1rsh/gujarati-f-openslr', '--train_metadata_dataset_name', 'Cossale/gujarati-f-openslr-tags-2k-tagged', '--train_dataset_config_name', 'default', '--train_split_name', 'train', '--eval_dataset_name', '1rsh/gujarati-f-openslr', '--eval_metadata_dataset_name', 'Cossale/gujarati-f-openslr-tags-2k-tagged', '--eval_dataset_config_name', 'default', '--eval_split_name', 'train', '--max_eval_samples', '8', '--per_device_eval_batch_size', '8', '--target_audio_column_name', 'audio', '--description_column_name', 'text_description', '--prompt_column_name', 'text', '--max_duration_in_seconds', '20', '--min_duration_in_seconds', '2.0', '--max_text_length', '400', '--preprocessing_num_workers', '2', '--do_train', 'true', '--num_train_epochs', '2', '--gradient_accumulation_steps', '18', '--gradient_checkpointing', 'true', '--per_device_train_batch_size', '2', '--learning_rate', '0.00008', '--adam_beta1', '0.9', '--adam_beta2', '0.99', '--weight_decay', '0.01', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '50', '--logging_steps', '2', '--freeze_text_encoder', 'true', '--audio_encoder_per_device_batch_size', '4', '--dtype', 'float16', '--seed', '456', '--output_dir', './output_dir_training/', '--temporary_save_to_disk', './audio_code_tmp/', '--save_to_disk', './tmp_dataset_audio/', '--dataloader_num_workers', '2', '--do_eval', '--predict_with_generate', '--include_inputs_for_metrics', '--group_by_length', 'true']' returned non-zero exit status 1.

Traning command (same as colab)

!accelerate launch ./training/run_parler_tts_training.py \
    --model_name_or_path "parler-tts/parler_tts_mini_v0.1" \
    --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
    --description_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
    --prompt_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
    --report_to "wandb" \
    --overwrite_output_dir true \
    --train_dataset_name "1rsh/gujarati-f-openslr" \
    --train_metadata_dataset_name "Cossale/gujarati-f-openslr-tags-2k-tagged" \
    --train_dataset_config_name "default" \
    --train_split_name "train" \
    --eval_dataset_name "1rsh/gujarati-f-openslr" \
    --eval_metadata_dataset_name "Cossale/gujarati-f-openslr-tags-2k-tagged" \
    --eval_dataset_config_name "default" \
    --eval_split_name "train" \
    --max_eval_samples 8 \
    --per_device_eval_batch_size 8 \
    --target_audio_column_name "audio" \
    --description_column_name "text_description" \
    --prompt_column_name "text" \
    --max_duration_in_seconds 20 \
    --min_duration_in_seconds 2.0 \
    --max_text_length 400 \
    --preprocessing_num_workers 2 \
    --do_train true \
    --num_train_epochs 2 \
    --gradient_accumulation_steps 18 \
    --gradient_checkpointing true \
    --per_device_train_batch_size 2 \
    --learning_rate 0.00008 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.01 \
    --lr_scheduler_type "constant_with_warmup" \
    --warmup_steps 50 \
    --logging_steps 2 \
    --freeze_text_encoder true \
    --audio_encoder_per_device_batch_size 4 \
    --dtype "float16" \
    --seed 456 \
    --output_dir "./output_dir_training/" \
    --temporary_save_to_disk "./audio_code_tmp/" \
    --save_to_disk "./tmp_dataset_audio/" \
    --dataloader_num_workers 2 \
    --do_eval \
    --predict_with_generate \
    --include_inputs_for_metrics \
    --group_by_length true

Let me know if you need any other details. i just used colab only changed the dataset, no other changes.

cesinsingapore commented 3 months ago

same, but my error is like this in the evaluation

Train steps ... : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2240/2240 [19:42:52<00:00, 29.61s/it]Configuration saved in /output_dir_training\config.json Configuration saved in /output_dir_training\generation_config.json Model weights saved in /output_dir_training\model.safetensors

Evaluating - Inference ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00, 2.89it/s]

File "C:\Users\ces-ai\Documents\parler-tts\training\run_parler_tts_training.py", line 1186, in ███████████████████████████████████████████████████████████| 8/8 [01:19<00:00, 8.68s/it] main() File "C:\Users\ces-ai\Documents\parler-tts\training\run_parler_tts_training.py", line 1106, in main key: torch.mean(torch.cat([d[key] for d in eval_metrics])).to("cpu") for key in eval_metrics[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated

and bellow is the training command

accelerate launch ./training/run_parler_tts_training.py --model_name_or_path "parler-tts/parler_tts_mini_v0.1" --feature_extractor_name "parler-tts/dac_44khZ_8kbps" --description_tokenizer_name "google/flan-t5-base" --prompt_tokenizer_name "google/flan-t5-base" --report_to "wandb" --overwrite_output_dir true --train_dataset_name "cesinsingapore/singlish" --train_metadata_dataset_name "cesinsingapore/jenny-singlish-1k-tagged" --train_dataset_config_name "default" --train_split_name "train" --eval_dataset_name "cesinsingapore/singlish" --eval_metadata_dataset_name "cesinsingapore/jenny-singlish-1k-tagged" --eval_dataset_config_name "default" --eval_split_name "train" --target_audio_column_name "audio" --description_column_name "text_description" --prompt_column_name "text" --max_duration_in_seconds 30 --min_duration_in_seconds 1.0 --max_text_length 400 --add_audio_samples_to_wandb true --preprocessing_num_workers 8 --do_train true --num_train_epochs 40 --gradient_accumulation_steps 8 --gradient_checkpointing false --per_device_train_batch_size 3 --learning_rate 0.00095 --adam_beta1 0.9 --adam_beta2 0.99 --weight_decay 0.01 --lr_scheduler_type "constant_with_warmup" --warmup_steps 20000 --logging_steps 1000 --freeze_text_encoder true --do_eval true --predict_with_generate true --include_inputs_for_metrics true --evaluation_strategy steps --eval_steps 10000 --save_steps 10000 --per_device_eval_batch_size 12 --audio_encoder_per_device_batch_size 20 --dtype "bfloat16" --seed 456 --output_dir "/output_dir_training" --temporary_save_to_disk "./audio_code_tmp/" --save_to_disk "./tmp_dataset_audio/" --max_eval_samples 96 --dataloader_num_workers 8 --group_by_length true

ylacombe commented 3 months ago

Hey @xellDart and @Aunali321, I've tried to reproduce your issue with the exact same config, but can't seem to reproduce it. Do you have the value of min(l, self.max_length)?

@cesinsingapore, thanks for the comment, I'll take a look into it later

Aunali321 commented 3 months ago

@ylacombe Here is the result: logs You can find the datasets on my profile: huggingface.co/Cossale

Aunali321 commented 3 months ago

a bit more detailed. https://bin.auna.li/doc/kauujdwg

ajd12342 commented 3 months ago

I had the same issue, I fixed it by modifying these lines https://github.com/huggingface/parler-tts/blob/9f34c1b8730efc9ed0337d96fd89e2ee6f1735b0/training/run_parler_tts_training.py#L347-L348 to

    max_target_length = int(data_args.max_duration_in_seconds * sampling_rate)
    min_target_length = int(data_args.min_duration_in_seconds * sampling_rate)
ylacombe commented 3 months ago

Hey @ajd12342, thanks for letting us know! It allowed me to find how to reproduce and to fix it in #102 !

@cesinsingapore, your issue should also be fixed.

Don't hesitate to let us know if you face any further issues!

ylacombe commented 2 months ago

Closing for now, feel free to open another issue if necessary, or re-open this one!