Questions about Training Specifics

sukjunhwang commented 5 months ago

Hi, thank you so much for the awesome work!

I have some questions about training details to check if I am getting it right.

Is the overall training pipeline as follows? : 1. MLM pre-training 2. Contrastive Learning pre-training 3. Finetuning.
For MLM pre-training, is the dataset the same to the one used for Contrastive Learning pre-trianing?
To replicate the results in the paper, can I simply use the configs at configs/train/mlm.yaml, configs/train/contrastive_pretrain.yaml, and configs/train/contrastive_finetune.yaml?
- Where can I find nomic-ai/bert-128-grouped?

Thank you so much :)

zanussbaum commented 5 months ago

Hey Sukjun, thanks for checking it out :)

yes that's the correct pipeline we follow
the mlm dataset is pretokenized here. It's different than the unsupervised contrastive training
For the MLM, you will need to change the config I will note below where you need to update it for the 2048 pretraining. If you don't care about that, then you can pretrain with a smaller sequence length. For the contrastive steps, yes those configs should work as is

Also thanks for pointing that out, I forgot to make the dataset public. You can find the pretokenized 128 Bert dataset here.

Below is the config for the 2048 pretraining. Apologies I kept it out as it does take a bit of compute.

train_args:
  num_epochs: 160
  learning_rate: 5.0e-4
  adam_beta1: 0.9
  adam_beta2: 0.98
  weight_decay: 1.0e-5
  eps: 1.0e-6
  max_grad_norm: 0.0
  schedule_type: "linear"
  gradient_accumulation_steps: 8

  warmup_steps: null
  warmup_pct: 0.06
  cooldown_steps: null
  checkpoint: null

  wandb: false
  wandb_project_name: "bert"
  wandb_entity: "gpt4all"

  log_grads_every: 100
  log_lr_every: 10
  save_every: 30000
  eval_every: 30000
  output_dir: "ckpts/mlm-trainer"
  # if using deepspeed, this will be ignored
  pretrained: null
  pooling: "last"
  use_fp8: false

model_args:
  "model_type": "mlm"
  seq_len: 2048
  rotary_emb_fraction: 1.0
  pad_vocab_to_multiple_of: 64 
  use_rms_norm: false
  activation_function: "swiglu"
  tokenizer_name: "bert-base-uncased"
  model_name: "bert-base-uncased"
  qkv_proj_bias: false
  mlp_fc1_bias: false
  mlp_fc2_bias: false
  attn_pdrop: 0.0
  gradient_checkpointing: false

mlm_data_args:
  tokenized_dataset: "nomic-ai/nomic-bert-2048-pretraining-data"
  workers: 4
  batch_size: 512
  seed: 42
  shuffle: true
  mlm_prob: 0.30
  val_mlm_prob: 0.15
  val_pct: 0.01

Let me know if you run into any issues

sukjunhwang commented 5 months ago

Hi Zach, I really appreciate your kind and prompt answer! It's working now, thank you :)

zanussbaum commented 5 months ago

Great to hear, feel free to reach out if you run into any other issues!

nomic-ai / contrastors

Questions about Training Specifics #33