Closed xellDart closed 2 months ago
What's the return value from min(l, self.max_length)
?
Hey , thanks for opening the issue, could you send a bigger log trace, so that I can have a better idea where it's happening ?
Same issue. Logs
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-08-13 19:03:46.003111: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-13 19:03:46.023949: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-13 19:03:46.030655: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-13 19:03:46.046192: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-13 19:03:47.216399: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Flash attention 2 is not installed
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
08/13/2024 19:04:00 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
08/13/2024 19:04:00 - INFO - __main__ - Training/evaluation parameters ParlerTTSTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.99,
adam_epsilon=1e-08,
audio_encoder_per_device_batch_size=4,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
compute_clap_similarity_metric=True,
compute_noise_level_metric=True,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=2,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
dtype=float16,
eval_accumulation_steps=None,
eval_dataloader_num_workers=0,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=18,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=True,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=8e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output_dir_training/runs/Aug13_19-03-50_0568c1b5750a,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=2,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=constant_with_warmup,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
noise_level_to_compute_clean_wer=25,
num_train_epochs=2.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=./output_dir_training/,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=2,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./output_dir_training/,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=456,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=50,
weight_decay=0.01,
)
loading configuration file preprocessor_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--dac_44khZ_8kbps/snapshots/db52bea859d9411e0beb44a3ea923a8731ee4197/preprocessor_config.json
Feature extractor EncodecFeatureExtractor {
"chunk_length_s": null,
"feature_extractor_type": "EncodecFeatureExtractor",
"feature_size": 1,
"overlap": null,
"padding_side": "right",
"padding_value": 0.0,
"return_attention_mask": true,
"sampling_rate": 44100
}
loading file spiece.model from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/spiece.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer_config.json
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
loading file spiece.model from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/spiece.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/tokenizer_config.json
08/13/2024 19:04:02 - WARNING - __main__ - Disabling fast tokenizer warning: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3231-L3235
Combining datasets...: 0% 0/1 [00:00<?, ?it/s]08/13/2024 19:04:07 - INFO - __main__ - Merging 1rsh/gujarati-f-openslr - train with Cossale/gujarati-f-openslr-tags-2k-tagged - train
08/13/2024 19:04:12 - INFO - __main__ - REMOVE text from dataset 1rsh/gujarati-f-openslr - dataset_dict['split']
Combining datasets...: 100% 1/1 [00:09<00:00, 9.97s/it]
Combining datasets...: 0% 0/1 [00:00<?, ?it/s]08/13/2024 19:04:15 - INFO - __main__ - Merging 1rsh/gujarati-f-openslr - train with Cossale/gujarati-f-openslr-tags-2k-tagged - train
08/13/2024 19:04:18 - INFO - __main__ - REMOVE text from dataset 1rsh/gujarati-f-openslr - dataset_dict['split']
Combining datasets...: 100% 1/1 [00:06<00:00, 6.66s/it]
config.json: 100% 7.42k/7.42k [00:00<00:00, 9.96MB/s]
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/config.json
Model config ParlerTTSConfig {
"_name_or_path": "/fsx/yoach/tmp/artefacts/training-400M-punctuated-v2/",
"architectures": [
"ParlerTTSForConditionalGeneration"
],
"audio_encoder": {
"_name_or_path": "ylacombe/dac_44khZ_8kbps",
"add_cross_attention": false,
"architectures": [
"DACModel"
],
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"codebook_size": 1024,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"frame_rate": 86,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"latent_dim": 1024,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_bitrate": 8,
"model_type": "dac",
"no_repeat_ngram_size": 0,
"num_beam_groups": 1,
"num_beams": 1,
"num_codebooks": 9,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sampling_rate": 44100,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "float32",
"torchscript": false,
"typical_p": 1.0,
"use_bfloat16": false
},
"decoder": {
"_name_or_path": "/fsx/yoach/tmp/artefacts/decoder_400M/",
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_cross_attention": true,
"architectures": [
"ParlerTTSForCausalLM"
],
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": 1025,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"cross_attention_implementation_strategy": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"dropout": 0.1,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": 1024,
"exponential_decay_length_penalty": null,
"ffn_dim": 4096,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_factor": 0.02,
"is_decoder": true,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layerdrop": 0.0,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 4096,
"min_length": 0,
"model_type": "parler_tts_decoder",
"no_repeat_ngram_size": 0,
"num_attention_heads": 16,
"num_beam_groups": 1,
"num_beams": 1,
"num_codebooks": 9,
"num_cross_attention_key_value_heads": 16,
"num_hidden_layers": 24,
"num_key_value_heads": 16,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": 1024,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"rope_embeddings": false,
"rope_theta": 10000.0,
"scale_embedding": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": false,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "float32",
"torchscript": false,
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": true,
"vocab_size": 1088
},
"decoder_start_token_id": 1025,
"is_encoder_decoder": true,
"model_type": "parler_tts",
"pad_token_id": 1024,
"prompt_cross_attention": false,
"text_encoder": {
"_name_or_path": "google/flan-t5-base",
"add_cross_attention": false,
"architectures": [
"T5ForConditionalGeneration"
],
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"classifier_dropout": 0.0,
"cross_attention_hidden_size": null,
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "gelu_new",
"diversity_penalty": 0.0,
"do_sample": false,
"dropout_rate": 0.1,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": 1,
"exponential_decay_length_penalty": null,
"feed_forward_proj": "gated-gelu",
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_factor": 1.0,
"is_decoder": false,
"is_encoder_decoder": true,
"is_gated_act": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_epsilon": 1e-06,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "t5",
"n_positions": 512,
"no_repeat_ngram_size": 0,
"num_beam_groups": 1,
"num_beams": 1,
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"output_scores": false,
"pad_token_id": 0,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": false,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": true,
"vocab_size": 32128
},
"torch_dtype": "float32",
"transformers_version": "4.43.3",
"vocab_size": 32128
}
model.safetensors: 100% 2.59G/2.59G [02:08<00:00, 20.1MB/s]
loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/model.safetensors
Generate config GenerationConfig {
"decoder_start_token_id": 1025,
"pad_token_id": 1024
}
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Generate config GenerationConfig {
"bos_token_id": 1025,
"eos_token_id": 1024,
"pad_token_id": 1024
}
All model checkpoint weights were used when initializing ParlerTTSForConditionalGeneration.
All the weights of ParlerTTSForConditionalGeneration were initialized from the model checkpoint at parler-tts/parler_tts_mini_v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ParlerTTSForConditionalGeneration for predictions without further training.
generation_config.json: 100% 267/267 [00:00<00:00, 794kB/s]
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--parler-tts--parler_tts_mini_v0.1/snapshots/e02fd18e77d38b49a85c7a9a85189a64b8472544/generation_config.json
Generate config GenerationConfig {
"bos_token_id": 1025,
"decoder_start_token_id": 1025,
"do_sample": true,
"eos_token_id": 1024,
"guidance_scale": 1.0,
"max_length": 2580,
"min_new_tokens": 50,
"pad_token_id": 1024
}
gathered_tensor tensor([0], device='cuda:0')
/usr/local/lib/python3.10/dist-packages/multiprocess/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
Filter (num_proc=2): 100% 1997/1997 [00:00<00:00, 9067.08 examples/s]
Filter (num_proc=2): 100% 8/8 [00:00<00:00, 41.61 examples/s]
preprocess datasets (num_proc=2): 100% 1997/1997 [00:01<00:00, 1841.35 examples/s]
preprocess datasets (num_proc=2): 100% 8/8 [00:00<00:00, 16.95 examples/s]
08/13/2024 19:06:45 - INFO - __main__ - *** Encode target audio with encodec ***
0% 0/500 [00:00<?, ?it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
39% 195/500 [02:27<03:50, 1.32it/s]
Traceback (most recent call last):
File "/content/parler-tts/./training/run_parler_tts_training.py", line 1186, in <module>
main()
File "/content/parler-tts/./training/run_parler_tts_training.py", line 494, in main
for i, batch in enumerate(tqdm(data_loader, disable=not accelerator.is_local_main_process)):
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 464, in __iter__
next_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 705, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/content/parler-tts/training/data.py", line 33, in __call__
audios = [audio[: min(l, self.max_length)] for audio, l in zip(audios, len_audio)]
File "/content/parler-tts/training/data.py", line 33, in <listcomp>
audios = [audio[: min(l, self.max_length)] for audio, l in zip(audios, len_audio)]
TypeError: slice indices must be integers or None or have an __index__ method
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /content/parler-tts/wandb/offline-run-20240813_190359-qifq19ti
wandb: Find logs at: ./wandb/offline-run-20240813_190359-qifq19ti/logs
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', '--model_name_or_path', 'parler-tts/parler_tts_mini_v0.1', '--feature_extractor_name', 'parler-tts/dac_44khZ_8kbps', '--description_tokenizer_name', 'parler-tts/parler_tts_mini_v0.1', '--prompt_tokenizer_name', 'parler-tts/parler_tts_mini_v0.1', '--report_to', 'wandb', '--overwrite_output_dir', 'true', '--train_dataset_name', '1rsh/gujarati-f-openslr', '--train_metadata_dataset_name', 'Cossale/gujarati-f-openslr-tags-2k-tagged', '--train_dataset_config_name', 'default', '--train_split_name', 'train', '--eval_dataset_name', '1rsh/gujarati-f-openslr', '--eval_metadata_dataset_name', 'Cossale/gujarati-f-openslr-tags-2k-tagged', '--eval_dataset_config_name', 'default', '--eval_split_name', 'train', '--max_eval_samples', '8', '--per_device_eval_batch_size', '8', '--target_audio_column_name', 'audio', '--description_column_name', 'text_description', '--prompt_column_name', 'text', '--max_duration_in_seconds', '20', '--min_duration_in_seconds', '2.0', '--max_text_length', '400', '--preprocessing_num_workers', '2', '--do_train', 'true', '--num_train_epochs', '2', '--gradient_accumulation_steps', '18', '--gradient_checkpointing', 'true', '--per_device_train_batch_size', '2', '--learning_rate', '0.00008', '--adam_beta1', '0.9', '--adam_beta2', '0.99', '--weight_decay', '0.01', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '50', '--logging_steps', '2', '--freeze_text_encoder', 'true', '--audio_encoder_per_device_batch_size', '4', '--dtype', 'float16', '--seed', '456', '--output_dir', './output_dir_training/', '--temporary_save_to_disk', './audio_code_tmp/', '--save_to_disk', './tmp_dataset_audio/', '--dataloader_num_workers', '2', '--do_eval', '--predict_with_generate', '--include_inputs_for_metrics', '--group_by_length', 'true']' returned non-zero exit status 1.
Traning command (same as colab)
!accelerate launch ./training/run_parler_tts_training.py \
--model_name_or_path "parler-tts/parler_tts_mini_v0.1" \
--feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
--description_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
--prompt_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
--report_to "wandb" \
--overwrite_output_dir true \
--train_dataset_name "1rsh/gujarati-f-openslr" \
--train_metadata_dataset_name "Cossale/gujarati-f-openslr-tags-2k-tagged" \
--train_dataset_config_name "default" \
--train_split_name "train" \
--eval_dataset_name "1rsh/gujarati-f-openslr" \
--eval_metadata_dataset_name "Cossale/gujarati-f-openslr-tags-2k-tagged" \
--eval_dataset_config_name "default" \
--eval_split_name "train" \
--max_eval_samples 8 \
--per_device_eval_batch_size 8 \
--target_audio_column_name "audio" \
--description_column_name "text_description" \
--prompt_column_name "text" \
--max_duration_in_seconds 20 \
--min_duration_in_seconds 2.0 \
--max_text_length 400 \
--preprocessing_num_workers 2 \
--do_train true \
--num_train_epochs 2 \
--gradient_accumulation_steps 18 \
--gradient_checkpointing true \
--per_device_train_batch_size 2 \
--learning_rate 0.00008 \
--adam_beta1 0.9 \
--adam_beta2 0.99 \
--weight_decay 0.01 \
--lr_scheduler_type "constant_with_warmup" \
--warmup_steps 50 \
--logging_steps 2 \
--freeze_text_encoder true \
--audio_encoder_per_device_batch_size 4 \
--dtype "float16" \
--seed 456 \
--output_dir "./output_dir_training/" \
--temporary_save_to_disk "./audio_code_tmp/" \
--save_to_disk "./tmp_dataset_audio/" \
--dataloader_num_workers 2 \
--do_eval \
--predict_with_generate \
--include_inputs_for_metrics \
--group_by_length true
Let me know if you need any other details. i just used colab only changed the dataset, no other changes.
same, but my error is like this in the evaluation
Train steps ... : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2240/2240 [19:42:52<00:00, 29.61s/it]Configuration saved in /output_dir_training\config.json Configuration saved in /output_dir_training\generation_config.json Model weights saved in /output_dir_training\model.safetensors
Evaluating - Inference ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00, 2.89it/s]
File "C:\Users\ces-ai\Documents\parler-tts\training\run_parler_tts_training.py", line 1186, in
and bellow is the training command
accelerate launch ./training/run_parler_tts_training.py --model_name_or_path "parler-tts/parler_tts_mini_v0.1" --feature_extractor_name "parler-tts/dac_44khZ_8kbps" --description_tokenizer_name "google/flan-t5-base" --prompt_tokenizer_name "google/flan-t5-base" --report_to "wandb" --overwrite_output_dir true --train_dataset_name "cesinsingapore/singlish" --train_metadata_dataset_name "cesinsingapore/jenny-singlish-1k-tagged" --train_dataset_config_name "default" --train_split_name "train" --eval_dataset_name "cesinsingapore/singlish" --eval_metadata_dataset_name "cesinsingapore/jenny-singlish-1k-tagged" --eval_dataset_config_name "default" --eval_split_name "train" --target_audio_column_name "audio" --description_column_name "text_description" --prompt_column_name "text" --max_duration_in_seconds 30 --min_duration_in_seconds 1.0 --max_text_length 400 --add_audio_samples_to_wandb true --preprocessing_num_workers 8 --do_train true --num_train_epochs 40 --gradient_accumulation_steps 8 --gradient_checkpointing false --per_device_train_batch_size 3 --learning_rate 0.00095 --adam_beta1 0.9 --adam_beta2 0.99 --weight_decay 0.01 --lr_scheduler_type "constant_with_warmup" --warmup_steps 20000 --logging_steps 1000 --freeze_text_encoder true --do_eval true --predict_with_generate true --include_inputs_for_metrics true --evaluation_strategy steps --eval_steps 10000 --save_steps 10000 --per_device_eval_batch_size 12 --audio_encoder_per_device_batch_size 20 --dtype "bfloat16" --seed 456 --output_dir "/output_dir_training" --temporary_save_to_disk "./audio_code_tmp/" --save_to_disk "./tmp_dataset_audio/" --max_eval_samples 96 --dataloader_num_workers 8 --group_by_length true
Hey @xellDart and @Aunali321, I've tried to reproduce your issue with the exact same config, but can't seem to reproduce it. Do you have the value of min(l, self.max_length)?
@cesinsingapore, thanks for the comment, I'll take a look into it later
@ylacombe Here is the result: logs You can find the datasets on my profile: huggingface.co/Cossale
a bit more detailed. https://bin.auna.li/doc/kauujdwg
I had the same issue, I fixed it by modifying these lines https://github.com/huggingface/parler-tts/blob/9f34c1b8730efc9ed0337d96fd89e2ee6f1735b0/training/run_parler_tts_training.py#L347-L348 to
max_target_length = int(data_args.max_duration_in_seconds * sampling_rate)
min_target_length = int(data_args.min_duration_in_seconds * sampling_rate)
Hey @ajd12342, thanks for letting us know! It allowed me to find how to reproduce and to fix it in #102 !
@cesinsingapore, your issue should also be fixed.
Don't hesitate to let us know if you face any further issues!
Closing for now, feel free to open another issue if necessary, or re-open this one!
Hi, I have an error when I try to fine tuning on my dataset
My dataset is: https://huggingface.co/datasets/xellDart13/audio_transcrib_jenna https://huggingface.co/datasets/xellDart13/audio_transcrib_jenna_test-tagged
And I run training with