Evaluate Qwen2.5's capabilities, which surpasses LLaMA 3.1 across all English benchmarks and demonstrates particularly strong performance in understanding Asian languages like Vietnamese and Singlish.
Assess the effectiveness of LoRA training in teaching the model to recognize and process sound tokens.
Methodology
Change the base model of Ichigo from Llama 3.1 to Qwen2.5 32B
Due to model size constraints, we employed LoRA adapters across all linear layers for both continued pretraining and supervised fine-tuning steps, while fully fine-tuning the embedding and LM head layers to accommodate 513 new sound tokens. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification (as discussed in Qwen2.5 Issue #29) which, according to Qwen Authors, aim at optimizing training performance(NVIDIA's matrix multiplication guidelines: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#), we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.
Goal
Methodology
Experiments
Learnings
Quicklinks