training run: Lora Ichigo Qwen2.5 32B

Goal

Evaluate Qwen2.5's capabilities, which surpasses LLaMA 3.1 across all English benchmarks and demonstrates particularly strong performance in understanding Asian languages like Vietnamese and Singlish.
Assess the effectiveness of LoRA training in teaching the model to recognize and process sound tokens.

Change the base model of Ichigo from Llama 3.1 to Qwen2.5 32B
Due to model size constraints, we employed LoRA adapters across all linear layers for both continued pretraining and supervised fine-tuning steps, while fully fine-tuning the embedding and LM head layers to accommodate 513 new sound tokens. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification (as discussed in Qwen2.5 Issue #29) which, according to Qwen Authors, aim at optimizing training performance(NVIDIA's matrix multiplication guidelines: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#), we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.
```
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"]
```
For pretrain data, I used only English subset of Pretrain Data and SFT on v0.4 dataset

Run ID	Date	Model Config	Dataset	Learning Rate	Batch Size	Steps	Loss	Hardware	MMLU	MMLU pro	Notes
exp1-pretrain	2024-11-23	Lora-256-512	Pretrain v0.1	1.5e-4	384	6302	1.9	~ 100 hours on 8xA6000	-	-	old dataset
exp1-sft	2024-11-27	Lora-256-512	SFT	3e-4	384	2500	1.2	20 hours on 8xA6000	-	-	stop early to prepare next run
exp2-pretrain	2024-11-26	Lora-256-512	Pretrain v0.2	1.5e-4	384	6302	Updated soon	~ 100 hours on 8xA6000	-	-	new dataset v0.2