Closed sukjunhwang closed 5 months ago
Hey Sukjun, thanks for checking it out :)
Also thanks for pointing that out, I forgot to make the dataset public. You can find the pretokenized 128 Bert dataset here.
Below is the config for the 2048 pretraining. Apologies I kept it out as it does take a bit of compute.
train_args:
num_epochs: 160
learning_rate: 5.0e-4
adam_beta1: 0.9
adam_beta2: 0.98
weight_decay: 1.0e-5
eps: 1.0e-6
max_grad_norm: 0.0
schedule_type: "linear"
gradient_accumulation_steps: 8
warmup_steps: null
warmup_pct: 0.06
cooldown_steps: null
checkpoint: null
wandb: false
wandb_project_name: "bert"
wandb_entity: "gpt4all"
log_grads_every: 100
log_lr_every: 10
save_every: 30000
eval_every: 30000
output_dir: "ckpts/mlm-trainer"
# if using deepspeed, this will be ignored
pretrained: null
pooling: "last"
use_fp8: false
model_args:
"model_type": "mlm"
seq_len: 2048
rotary_emb_fraction: 1.0
pad_vocab_to_multiple_of: 64
use_rms_norm: false
activation_function: "swiglu"
tokenizer_name: "bert-base-uncased"
model_name: "bert-base-uncased"
qkv_proj_bias: false
mlp_fc1_bias: false
mlp_fc2_bias: false
attn_pdrop: 0.0
gradient_checkpointing: false
mlm_data_args:
tokenized_dataset: "nomic-ai/nomic-bert-2048-pretraining-data"
workers: 4
batch_size: 512
seed: 42
shuffle: true
mlm_prob: 0.30
val_mlm_prob: 0.15
val_pct: 0.01
Let me know if you run into any issues
Hi Zach, I really appreciate your kind and prompt answer! It's working now, thank you :)
Great to hear, feel free to reach out if you run into any other issues!
Hi, thank you so much for the awesome work!
I have some questions about training details to check if I am getting it right.
configs/train/mlm.yaml
,configs/train/contrastive_pretrain.yaml
, andconfigs/train/contrastive_finetune.yaml
?nomic-ai/bert-128-grouped
?Thank you so much :)