nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
512 stars 37 forks source link

Questions about Training Specifics #33

Closed sukjunhwang closed 5 months ago

sukjunhwang commented 5 months ago

Hi, thank you so much for the awesome work!

I have some questions about training details to check if I am getting it right.

Thank you so much :)

zanussbaum commented 5 months ago

Hey Sukjun, thanks for checking it out :)

  1. yes that's the correct pipeline we follow
  2. the mlm dataset is pretokenized here. It's different than the unsupervised contrastive training
  3. For the MLM, you will need to change the config I will note below where you need to update it for the 2048 pretraining. If you don't care about that, then you can pretrain with a smaller sequence length. For the contrastive steps, yes those configs should work as is

Also thanks for pointing that out, I forgot to make the dataset public. You can find the pretokenized 128 Bert dataset here.

Below is the config for the 2048 pretraining. Apologies I kept it out as it does take a bit of compute.

train_args:
  num_epochs: 160
  learning_rate: 5.0e-4
  adam_beta1: 0.9
  adam_beta2: 0.98
  weight_decay: 1.0e-5
  eps: 1.0e-6
  max_grad_norm: 0.0
  schedule_type: "linear"
  gradient_accumulation_steps: 8

  warmup_steps: null
  warmup_pct: 0.06
  cooldown_steps: null
  checkpoint: null

  wandb: false
  wandb_project_name: "bert"
  wandb_entity: "gpt4all"

  log_grads_every: 100
  log_lr_every: 10
  save_every: 30000
  eval_every: 30000
  output_dir: "ckpts/mlm-trainer"
  # if using deepspeed, this will be ignored
  pretrained: null
  pooling: "last"
  use_fp8: false

model_args:
  "model_type": "mlm"
  seq_len: 2048
  rotary_emb_fraction: 1.0
  pad_vocab_to_multiple_of: 64 
  use_rms_norm: false
  activation_function: "swiglu"
  tokenizer_name: "bert-base-uncased"
  model_name: "bert-base-uncased"
  qkv_proj_bias: false
  mlp_fc1_bias: false
  mlp_fc2_bias: false
  attn_pdrop: 0.0
  gradient_checkpointing: false

mlm_data_args:
  tokenized_dataset: "nomic-ai/nomic-bert-2048-pretraining-data"
  workers: 4
  batch_size: 512
  seed: 42
  shuffle: true
  mlm_prob: 0.30
  val_mlm_prob: 0.15
  val_pct: 0.01

Let me know if you run into any issues

sukjunhwang commented 5 months ago

Hi Zach, I really appreciate your kind and prompt answer! It's working now, thank you :)

zanussbaum commented 5 months ago

Great to hear, feel free to reach out if you run into any other issues!