Open FurkanGozukara opened 1 year ago
Here my config but i have no idea it is making checkpoints pretty fast too
Environment name is set as "DLAS" as per environment.yaml anaconda3/miniconda3 detected in C:\Users\King\miniconda3 Starting conda environment "DLAS" from C:\Users\King\miniconda3 Latest git hash: 5ab4d9e Disabled distributed training. ===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link ================================================================================ CUDA SETUP: Loading binary C:\Users\King\miniconda3\envs\DLAS\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... 23-05-01 03:03:10.550 - INFO: name: test1 model: extensibletrainer scale: 1 gpu_ids: [0] start_step: -1 checkpointing_enabled: True fp16: False use_8bit: True wandb: False use_tb_logger: True datasets:[ train:[ name: test1 n_workers: 8 batch_size: 138 mode: paired_voice_audio path: F:/ozen-toolkit/output/test\train.txt fetcher_mode: ['lj'] phase: train max_wav_length: 255995 max_text_length: 200 sample_rate: 22050 load_conditioning: True num_conditioning_candidates: 2 conditioning_length: 44000 use_bpe_tokenizer: True load_aligned_codes: False data_type: img ] val:[ name: test1 n_workers: 1 batch_size: 139 mode: paired_voice_audio path: F:/ozen-toolkit/output/test\valid.txt fetcher_mode: ['lj'] phase: val max_wav_length: 255995 max_text_length: 200 sample_rate: 22050 load_conditioning: True num_conditioning_candidates: 2 conditioning_length: 44000 use_bpe_tokenizer: True load_aligned_codes: False data_type: img ] ] steps:[ gpt_train:[ training: gpt loss_log_buffer: 500 optimizer: adamw optimizer_params:[ lr: 1e-05 triton: False weight_decay: 0.01 beta1: 0.9 beta2: 0.96 ] clip_grad_eps: 4 injectors:[ paired_to_mel:[ type: torch_mel_spectrogram mel_norm_file: ../experiments/clips_mel_norms.pth in: wav out: paired_mel ] paired_cond_to_mel:[ type: for_each subtype: torch_mel_spectrogram mel_norm_file: ../experiments/clips_mel_norms.pth in: conditioning out: paired_conditioning_mel ] to_codes:[ type: discrete_token in: paired_mel out: paired_mel_codes dvae_config: ../experiments/train_diffusion_vocoder_22k_level.yml ] paired_fwd_text:[ type: generator generator: gpt in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths'] out: ['loss_text_ce', 'loss_mel_ce', 'logits'] ] ] losses:[ text_ce:[ type: direct weight: 0.01 key: loss_text_ce ] mel_ce:[ type: direct weight: 1 key: loss_mel_ce ] ] ] ] networks:[ gpt:[ type: generator which_model_G: unified_voice2 kwargs:[ layers: 30 model_dim: 1024 heads: 16 max_text_tokens: 402 max_mel_tokens: 604 max_conditioning_inputs: 2 mel_length_compression: 1024 number_text_tokens: 256 number_mel_codes: 8194 start_mel_token: 8192 stop_mel_token: 8193 start_text_token: 255 train_solo_embeddings: False use_mel_codes_as_input: True checkpointing: True ] ] ] path:[ pretrain_model_gpt: ../experiments/autoregressive.pth strict_load: True root: F:\DL-Art-School experiments_root: F:\DL-Art-School\experiments\test1 models: F:\DL-Art-School\experiments\test1\models training_state: F:\DL-Art-School\experiments\test1\training_state log: F:\DL-Art-School\experiments\test1 val_images: F:\DL-Art-School\experiments\test1\val_images ] train:[ niter: 50000 warmup_iter: -1 mega_batch_factor: 4 val_freq: 500 default_lr_scheme: MultiStepLR gen_lr_steps: [400, 800, 1120, 1440] lr_gamma: 0.5 ema_enabled: False manual_seed: 1337 ] eval:[ output_state: gen injectors:[ gen_inj_eval:[ type: generator generator: generator in: hq out: ['gen', 'codebook_commitment_loss'] ] ] ] logger:[ print_freq: 100 save_checkpoint_freq: 500 visuals: ['gen', 'mel'] visual_debug_rate: 500 is_mel_spectrogram: True disable_state_saving: False ] upgrades:[ number_of_checkpoints_to_save: 0 number_of_states_to_save: 0 ] is_train: True dist: False 23-05-01 03:03:10.734 - INFO: Random seed: 1337 23-05-01 03:03:16.202 - INFO: Number of training data elements: 552, iters: 4 23-05-01 03:03:16.203 - INFO: Total epochs needed: 12500 for iters 50,000 23-05-01 03:03:16.206 - INFO: Number of val images in [test1]: 139 C:\Users\King\miniconda3\envs\DLAS\lib\site-packages\transformers\configuration_utils.py:379: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`. warnings.warn( Loading from ../experiments/dvae.pth 23-05-01 03:03:24.183 - INFO: Network gpt structure: DataParallel, with parameters: 421,526,786 23-05-01 03:03:24.183 - INFO: UnifiedVoice( (conditioning_encoder): ConditioningEncoder( (init): Conv1d(80, 1024, kernel_size=(1,), stride=(1,)) (attn): Sequential( (0): AttentionBlock( (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True) (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,)) (attention): QKVAttentionLegacy() (x_proj): Identity() (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,)) ) (1): AttentionBlock( (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True) (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,)) (attention): QKVAttentionLegacy() (x_proj): Identity() (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,)) ) (2): AttentionBlock( (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True) (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,)) (attention): QKVAttentionLegacy() (x_proj): Identity() (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,)) ) (3): AttentionBlock( (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True) (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,)) (attention): QKVAttentionLegacy() (x_proj): Identity() (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,)) ) (4): AttentionBlock( (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True) (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,)) (attention): QKVAttentionLegacy() (x_proj): Identity() (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,)) ) (5): AttentionBlock( (norm): GroupNorm32(32, 1024, eps=1e-05, affine=True) (qkv): Conv1d(1024, 3072, kernel_size=(1,), stride=(1,)) (attention): QKVAttentionLegacy() (x_proj): Identity() (proj_out): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,)) ) ) ) (text_embedding): Embedding(256, 1024) (mel_embedding): Embedding(8194, 1024) (gpt): GPT2Model( (drop): Dropout(p=0.1, inplace=False) (h): ModuleList( (0): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (1): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (2): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (3): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (4): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (5): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (6): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (7): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (8): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (9): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (10): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (11): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (12): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (13): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (14): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (15): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (16): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (17): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (18): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (19): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (20): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (21): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (22): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (23): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (24): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (25): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (26): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (27): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (28): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) (29): GPT2Block( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D() (c_proj): Conv1D() (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (mel_pos_embedding): LearnedPositionEmbeddings( (emb): Embedding(608, 1024) ) (text_pos_embedding): LearnedPositionEmbeddings( (emb): Embedding(404, 1024) ) (final_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (text_head): Linear(in_features=1024, out_features=256, bias=True) (mel_head): Linear(in_features=1024, out_features=8194, bias=True) ) 23-05-01 03:03:24.188 - INFO: Loading model for [../experiments/autoregressive.pth] 23-05-01 03:03:25.660 - INFO: Start training from epoch: 0, iter: -1 0%| | 0/4 [00:00<?, ?it/s]C:\Users\King\miniconda3\envs\DLAS\lib\site-packages\torch\optim\lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " 23-05-01 03:04:11.737 - INFO: [epoch: 0, iter: 0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 1.3800e+02 megasamples: 1.3800e-04 iteration_rate: 8.9322e-02 loss_text_ce: 3.8859e+00 loss_mel_ce: 3.5641e+00 loss_gpt_total: 3.6030e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.3800e+02 percent_skipped_samples: 0.0000e+00 percent_conditioning_is_self: 1.0000e+00 gpt_conditioning_encoder: 7.4988e+00 gpt_gpt: 4.9755e+00 gpt_heads: 3.7014e+00 23-05-01 03:04:11.737 - INFO: Saving models and training states. 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:18<00:00, 19.70s/it] 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:15<00:00, 18.81s/it] 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:13<00:00, 18.49s/it]
Here my config but i have no idea it is making checkpoints pretty fast too