venkatasg / Lil-Bevo

UT Austin's submission to BabyLM Challenge
https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a
MIT License
2 stars 2 forks source link
babylm

Lil Bevo — UT Austin's submission to BabyLM Challenge

This repository contains code and instructions to build Lil Bevo — UT Austin's submission towards the BabyLM Challenge.

Python Environment

Install latest version of miniconda from here.

To recreate the exact python environment configuration in conda, run the following commands in order:

conda create -n bevo pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia jupyter pandas numpy matplotlib scikit-learn tqdm
pip install git+https://github.com/huggingface/transformers wandb ipdb datasets sentencepiece evaluate pytest accelerate mido

Scripts

training_bevo.py takes as argument any encoder style LM on the Huggingface Hub, and trains the model on babyLM data. First, concatenate all the train and dev files into one text file to pass as input to this script (cat babylm_data/babylm_10M/*.train > train.txt). Set the WANDB_PROJECT environment variable to lil-bevo and run.

export WANDB_PROJECT="lil-bevo"
python training_bevo.py --config_name microsoft/deberta-v3-small --tokenizer_name_or_path tokenizers/10m_maestro/ --train_file babylm_data/maestro/all-10M.txt --validation_file babylm_data/babylm_dev/dev.txt --per_device_train_batch_size 770 --per_device_eval_batch_size 128 --do_train --num_train_epochs 5 --do_eval --save_strategy epoch --optim adamw_torch_fused --warmup_ratio=0.0001 --weight_decay 0.1 --log_level error --learning_rate 5e-4 --evaluation_strategy steps --eval_steps 500 --output_dir deberta-small/redux/ --logging_steps 10 --save_total_limit 1 --overwrite_output_dir --torch_compile True --disable_tqdm False --max_seq_length 32 --report_to wandb

Evaluation

To setup evaluation pipeline as the BabyLM repo instructs, but in a separate conda environment:

git clone https://github.com/babylm/evaluation-pipeline
cd evaluation-pipeline
conda create -n babyeval python==3.9 pip git-lfs
conda activate babyeval
pip install --no-build-isolation -e ".[dev]"
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 sentencepiece

Models

We trained two models, one for the strict-small track and another for the strict track:

We also pretrained some ablation models, details in [our paper](). You can find all of our models in this Huggingface collection

Training Regime for Lil-Bevo

  1. 5 epochs on MAESTRO dataset (85M non-language music tokens) combined with strict small dataset.
  2. 50 epochs of pretraining with sequence length of 128 on strict-small dataset.
  3. 2 epochs of targeted MLM.

Training Regime for Lil-Bevo-X

  1. 5 epochs on MAESTRO dataset (85M non-language music tokens) combined with strict small dataset.
  2. 50 epochs of pretraining with sequence length of 128 on strict dataset.
  3. 150 epochs of pretraining with sequence length of 512 on strict dataset.
  4. 10 epochs of targeted MLM.

Please read [our paper]() to get more details on our training regime and reasoning behind these decisions.

Results

DynaBench

Model Score
Lil-Bevo 0.64
Lil-Bevo-X 0.69
BLiMP Model Anaphor Agr. Agr. Structure Binding Control/Raising D-N Agr. Ellipsis Filler-Gap Irregular Forms Island Effects NPI Licensing Quantifiers S-V Agr.
Lil-Bevo 90.9 72.5 63.3 70.0 91.7 82.0 77.5 85.3 55.8 78.5 68.7 84.8
Lil-Bevo-X 97.2 80.6 63.9 69.5 96.4 87.0 78.4 89.2 71.4 85.6 63.2 86.3
BLiMP Supplement Model Hypernym QA Congruence (easy) QA Congruence (tricky) Subj.-Aux. Inversion Turn Taking
Lil-Bevo 48.1 82.8 57.0 76.5 68.2
Lil-Bevo-X 45.2 75.0 63.6 81.4 78.2
(Super)GLUE Model CoLA SST-2 MRPC (F1) QQP (F1) MNLI MNLI-mm QNLI RTE BoolQ MultiRC WSC
Lil-Bevo 73.7 88.4 82.2 85.5 75.4 76.3 81.6 46.5 65.4 66.0 61.5
Lil-Bevo-X 76.5 88.8 82.6 86.4 77.7 79.0 83.6 49.5 68.0 65.6 61.4
MSGS Model CR (Control) CR_LC CR_RTP LC (Control) MV (Control) MV_LC MV_RTP RP (Control) SC (Control) SC_LC SC_RP
Lil-Bevo 91.9 66.6 67.4 100.0 99.8 75.7 78.0 93.8 91.5 65.7 64.2
Lil-Bevo-X 92.5 66.5 68.5 100.0 100.0 66.7 68.5 99.1 90.0 68.2 64.7
Age-of-acquisition Prediction (Mean absolute deviation in months across LOO cross-validation folds) Model Overall (591 words) Nouns (322) Predicates (167) Function words (102)
Lil-Bevo 2.06 2.0 1.84 2.65
Lil-Bevo-X 2.05 1.99 1.85 2.59