Lil Bevo — UT Austin's submission to BabyLM Challenge

This repository contains code and instructions to build Lil Bevo — UT Austin's submission towards the BabyLM Challenge.

Python Environment

Install latest version of miniconda from here.

To recreate the exact python environment configuration in conda, run the following commands in order:

conda create -n bevo pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia jupyter pandas numpy matplotlib scikit-learn tqdm
pip install git+https://github.com/huggingface/transformers wandb ipdb datasets sentencepiece evaluate pytest accelerate mido

Scripts

training_bevo.py takes as argument any encoder style LM on the Huggingface Hub, and trains the model on babyLM data. First, concatenate all the train and dev files into one text file to pass as input to this script (cat babylm_data/babylm_10M/*.train > train.txt). Set the WANDB_PROJECT environment variable to lil-bevo and run.

export WANDB_PROJECT="lil-bevo"
python training_bevo.py --config_name microsoft/deberta-v3-small --tokenizer_name_or_path tokenizers/10m_maestro/ --train_file babylm_data/maestro/all-10M.txt --validation_file babylm_data/babylm_dev/dev.txt --per_device_train_batch_size 770 --per_device_eval_batch_size 128 --do_train --num_train_epochs 5 --do_eval --save_strategy epoch --optim adamw_torch_fused --warmup_ratio=0.0001 --weight_decay 0.1 --log_level error --learning_rate 5e-4 --evaluation_strategy steps --eval_steps 500 --output_dir deberta-small/redux/ --logging_steps 10 --save_total_limit 1 --overwrite_output_dir --torch_compile True --disable_tqdm False --max_seq_length 32 --report_to wandb

Evaluation

To setup evaluation pipeline as the BabyLM repo instructs, but in a separate conda environment:

git clone https://github.com/babylm/evaluation-pipeline
cd evaluation-pipeline
conda create -n babyeval python==3.9 pip git-lfs
conda activate babyeval
pip install --no-build-isolation -e ".[dev]"
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 sentencepiece

Models

We trained two models, one for the strict-small track and another for the strict track:

Lil-Bevo is based on a deberta-small-v3 model, and has 55M parameters with a vocab size of 16640.
Lil-Bevo-X is based on a deberta-base-v3 model and has 112M parameters with a vocab size of 33280.

We also pretrained some ablation models, details in [our paper](). You can find all of our models in this Huggingface collection

Training Regime for Lil-Bevo

5 epochs on MAESTRO dataset (85M non-language music tokens) combined with strict small dataset.
50 epochs of pretraining with sequence length of 128 on strict-small dataset.
2 epochs of targeted MLM.

Training Regime for Lil-Bevo-X

5 epochs on MAESTRO dataset (85M non-language music tokens) combined with strict small dataset.
50 epochs of pretraining with sequence length of 128 on strict dataset.
150 epochs of pretraining with sequence length of 512 on strict dataset.
10 epochs of targeted MLM.

Please read [our paper]() to get more details on our training regime and reasoning behind these decisions.

Results

DynaBench

Model	Score
Lil-Bevo	0.64
Lil-Bevo-X	0.69

BLiMP	Model	Anaphor Agr.	Agr. Structure	Binding	Control/Raising	D-N Agr.	Ellipsis	Filler-Gap	Irregular Forms	Island Effects	NPI Licensing	Quantifiers	S-V Agr.
Lil-Bevo	90.9	72.5	63.3	70.0	91.7	82.0	77.5	85.3	55.8	78.5	68.7	84.8
Lil-Bevo-X	97.2	80.6	63.9	69.5	96.4	87.0	78.4	89.2	71.4	85.6	63.2	86.3

BLiMP Supplement	Model	Hypernym	QA Congruence (easy)	QA Congruence (tricky)	Subj.-Aux. Inversion	Turn Taking
Lil-Bevo	48.1	82.8	57.0	76.5	68.2
Lil-Bevo-X	45.2	75.0	63.6	81.4	78.2

(Super)GLUE	Model	CoLA	SST-2	MRPC (F1)	QQP (F1)	MNLI	MNLI-mm	QNLI	RTE	BoolQ	MultiRC	WSC
Lil-Bevo	73.7	88.4	82.2	85.5	75.4	76.3	81.6	46.5	65.4	66.0	61.5
Lil-Bevo-X	76.5	88.8	82.6	86.4	77.7	79.0	83.6	49.5	68.0	65.6	61.4

MSGS	Model	CR (Control)	CR_LC	CR_RTP	LC (Control)	MV (Control)	MV_LC	MV_RTP	RP (Control)	SC (Control)	SC_LC	SC_RP
Lil-Bevo	91.9	66.6	67.4	100.0	99.8	75.7	78.0	93.8	91.5	65.7	64.2
Lil-Bevo-X	92.5	66.5	68.5	100.0	100.0	66.7	68.5	99.1	90.0	68.2	64.7

Age-of-acquisition Prediction (Mean absolute deviation in months across LOO cross-validation folds)	Model	Overall (591 words)	Nouns (322)	Predicates (167)	Function words (102)
Lil-Bevo	2.06	2.0	1.84	2.65
Lil-Bevo-X	2.05	1.99	1.85	2.59

venkatasg / Lil-Bevo

readme