microsoft / evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models
MIT License
526 stars 73 forks source link

padding_idx=masking_idx in ByteNetLMTime instantiation arguments #32

Closed cmarkak closed 7 months ago

cmarkak commented 9 months ago

The following code in train.py assigns padding_idx=masking_idx in the model initiation. This is conflicting with the definition above for padding_idx which is different from masking_idx. Is this an oversight or there is a particular reason for this assignment?

padding_idx = tokenizer.pad_id # PROTEIN_ALPHABET.index(PAD) masking_idx = tokenizer.mask_id print('Using {} as padding index'.format(padding_idx)) print('Using {} as masking index'.format(masking_idx)) #if args.model_type == 'ByteNet': model = ByteNetLMTime(n_tokens, d_embed, d_model, n_layers, kernel_size, r, causal=causal, padding_idx=masking_idx, rank=weight_rank, dropout=args.dropout, tie_weights=args.tie_weights, final_ln=args.final_norm, slim=slim, activation=activation, timesteps=diffusion_timesteps)

Thank you in advance

sarahalamdari commented 7 months ago

This is done on purpose. We follow how ESM handles mask tokens. Padding is handled with input_mask