mit-han-lab / hardware-aware-transformers

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
https://hat.mit.edu
Other
329 stars 50 forks source link

Error in step 2.3 (Evolutionary search with latency constraint) #6

Closed ihish52 closed 3 years ago

ihish52 commented 4 years ago

Hi,

When following the code step by step, I get an error when running the evolutionary search. The error is: "_th_admm_out not supported on CPUType for Half"

Do you know what could be causing this and how to fix it? I am currently running this for my i5 CPU. Does the config file need any change to avoid using the GPU when only the CPU is being tested?

Help with this would be highly appreciated. Thanks.

Hanrui-Wang commented 4 years ago

Hi ihish52,

Thanks for your question! Could you provide more details about your command and which line caught the error?

ihish52 commented 4 years ago

Thanks for the quick reply. Attached is the config file I am using to perform the evolutionary search for my i5 CPU (barely any change from your example). There are no NVIDIA drivers installed so I do not think the GPU is affecting this.

wmt14ende_i5.zip

Below is the output when I run the command for evo_search.py:

python3 evo_search.py --configs=configs/wmt14.en-de/supertransformer/space0.yml --evo-configs=configs/wmt14.en-de/evo_search/wmt14ende_i5.yml

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformersuper_wmt_en_de', attention_dropout=0.1, beam=5, best_checkpoint_metric='loss', bucket_cap_mb=25, ckpt_path='./latency_dataset/predictors/wmt14ende_cpu_i5.pt', clip_norm=0.0, configs='configs/wmt14.en-de/supertransformer/space0.yml', cpu=False, criterion='label_smoothed_cross_entropy', crossover_size=50, curriculum=0, data='data/binary/wmt16_en_de', dataset_impl=None, ddp_backend='no_c10d', decoder_arbitrary_ende_attn_all_subtransformer=None, decoder_arbitrary_ende_attn_choice=[-1, 1, 2], decoder_attention_heads=8, decoder_embed_choice=[640, 512], decoder_embed_dim=640, decoder_embed_dim_subtransformer=None, decoder_embed_path=None, decoder_ende_attention_heads_all_subtransformer=None, decoder_ende_attention_heads_choice=[8, 4], decoder_ffn_embed_dim=3072, decoder_ffn_embed_dim_all_subtransformer=None, decoder_ffn_embed_dim_choice=[3072, 2048, 1024], decoder_input_dim=640, decoder_layer_num_choice=[6, 5, 4, 3, 2, 1], decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=640, decoder_self_attention_heads_all_subtransformer=None, decoder_self_attention_heads_choice=[8, 4], device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, diverse_beam_groups=-1, diverse_beam_strength=0.5, dropout=0.3, encoder_attention_heads=8, encoder_embed_choice=[640, 512], encoder_embed_dim=640, encoder_embed_dim_subtransformer=None, encoder_embed_path=None, encoder_ffn_embed_dim=3072, encoder_ffn_embed_dim_all_subtransformer=None, encoder_ffn_embed_dim_choice=[3072, 2048, 1024], encoder_layer_num_choice=[6], encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_self_attention_heads_all_subtransformer=None, encoder_self_attention_heads_choice=[8, 4], evo_configs='configs/wmt14.en-de/evo_search/wmt14ende_i5.yml', evo_iter=30, feature_norm=[640.0, 6.0, 2048.0, 6.0, 640.0, 6.0, 2048.0, 6.0, 6.0, 2.0], find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, get_attn=False, keep_interval_updates=-1, keep_last_epochs=20, label_smoothing=0.1, lat_norm=700.0, latency_constraint=6000.0, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, lr=[1e-07], lr_period_updates=-1, lr_scheduler='cosine', lr_shrink=1.0, match_source_len=False, max_epoch=0, max_len_a=0, max_len_b=200, max_lr=0.001, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_tokens_valid=4096, max_update=40000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, min_lr=-1, model_overrides='{}', mutation_prob=0.3, mutation_size=50, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_repeat_ngram_size=0, no_save=False, no_save_optimizer_state=False, no_token_positional_embeddings=False, num_workers=10, optimizer='adam', optimizer_overrides='{}', parent_size=25, path=None, pdb=False, population_size=125, prefix_size=0, print_alignment=False, profile_latency=False, qkv_dim=512, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='./downloaded_models/HAT_wmt14ende_super_space0.pt', results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, save_dir='checkpoints/wmt14.en-de/supertransformer/space0', save_interval=10, save_interval_updates=0, score_reference=False, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, t_mult=1, target_lang=None, task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='checkpoints/wmt14.en-de/supertransformer/space0/tensorboard', threshold_loss_scale=None, train_subset='train', unkpen=0, unnormalized=False, update_freq=[16], upsample_primary=1, use_bmuf=False, user_dir=None, valid_cnt_max=1000000000.0, valid_subset='valid', validate_interval=10, vocab_original_scaling=False, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.0, write_config_path='configs/wmt14.en-de/subtransformer/wmt14ende_i5.yml') | [en] dictionary: 32768 types | [de] dictionary: 32768 types | loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.en | loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.de | data/binary/wmt16_en_de valid en-de 3000 examples | Fallback to xavier initializer TransformerSuperModel( (encoder): TransformerEncoder( (embed_tokens): EmbeddingSuper(32768, 640, padding_idx=1) (embed_positions): SinusoidalPositionalEmbedding() (layers): ModuleList( (0): TransformerEncoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (1): TransformerEncoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (2): TransformerEncoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (3): TransformerEncoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (4): TransformerEncoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (5): TransformerEncoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) ) ) (decoder): TransformerDecoder( (embed_tokens): EmbeddingSuper(32768, 640, padding_idx=1) (embed_positions): SinusoidalPositionalEmbedding() (layers): ModuleList( (0): TransformerDecoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (1): TransformerDecoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (2): TransformerDecoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (3): TransformerDecoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (4): TransformerDecoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) (5): TransformerDecoderLayer( (self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512 (out_proj): LinearSuper(in_features=512, out_features=640, bias=True) ) (encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) (fc1): LinearSuper(in_features=640, out_features=3072, bias=True) (fc2): LinearSuper(in_features=3072, out_features=640, bias=True) (final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True) ) ) ) ) | loaded checkpoint ./downloaded_models/HAT_wmt14ende_super_space0.pt (epoch 136 @ 0 updates) | loading train data for epoch 136 | loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.en | loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.de | data/binary/wmt16_en_de valid en-de 3000 examples | Start Iteration 0: Traceback (most recent call last):
File "evo_search.py", line 106, in cli_main() File "evo_search.py", line 102, in cli_main main(args) File "evo_search.py", line 51, in main best_config = evolver.run_evo_search() File "/home/hishan/hardware-aware-transformers/fairseq/evolution.py", line 217, in run_evo_search popu_scores = self.get_scores(popu) File "/home/hishan/hardware-aware-transformers/fairseq/evolution.py", line 281, in get_scores scores = validate_all(self.args, self.trainer, self.task, self.epoch_iter, configs) File "/home/hishan/hardware-aware-transformers/fairseq/evolution.py", line 401, in validate_all trainer.valid_step(sample) File "/home/hishan/hardware-aware-transformers/fairseq/trainer.py", line 451, in valid_step raise e File "/home/hishan/hardware-aware-transformers/fairseq/trainer.py", line 438, in valid_step _loss, sample_size, logging_output = self.task.valid_step( File "/home/hishan/hardware-aware-transformers/fairseq/tasks/fairseq_task.py", line 241, in valid_step loss, sample_size, logging_output = criterion(model, sample) File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/hishan/hardware-aware-transformers/fairseq/criterions/label_smoothed_cross_entropy.py", line 56, in forward net_output = model(sample['net_input']) File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/hishan/hardware-aware-transformers/fairseq/models/fairseq_model.py", line 222, in forward encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, kwargs) File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/home/hishan/hardware-aware-transformers/fairseq/models/transformer_super.py", line 401, in forward x = layer(x, encoder_padding_mask) File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/hishan/hardware-aware-transformers/fairseq/models/transformersuper.py", line 900, in forward x, = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask) File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/hishan/hardware-aware-transformers/fairseq/modules/multihead_attention_super.py", line 182, in forward q, k, v = self.in_proj_qkv(query) File "/home/hishan/hardware-aware-transformers/fairseq/modules/multihead_attention_super.py", line 314, in in_proj_qkv return self._in_proj(query, sample_dim=self.sample_q_embed_dim).chunk(3, dim=-1) File "/home/hishan/hardware-aware-transformers/fairseq/modules/multihead_attention_super.py", line 351, in _in_proj return F.linear(input, weight, bias) File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1676, in linear output = input.matmul(weight.t()) RuntimeError: _th_addmm_out not supported on CPUType for Half

ihish52 commented 4 years ago

Still not sure what caused this. Made a new environment and installed all dependencies again, with versions exactly as stated in the requirements and it worked. Closing this issue. Thanks for the response.

Hanrui-Wang commented 4 years ago

Hi ihish52,

Sorry for my late reply, I was too busy in the past several weeks. The reason for the error is that some methods on float16 are not supported by PyTorch CPU, so I fixed by using fp32 when performing the evolutionary search on CPU. (commit)

Thanks for your contribution!

Best, Hanrui

Hanrui-Wang commented 3 years ago

Hi Hishan,

I will close the issue for now. Feel free to reopen if you have any further questions!

Best, Hanrui