A sequential distillation test was failing with a device mismatch error when running on multiple GPUs. This was caused by the student model being forced to "cuda:0" while the teacher model was defaulting to "auto" device mapping. The fix is to initialize both models with device_map="auto", then transformers handles the multi-gpu case automatically
A sequential distillation test was failing with a device mismatch error when running on multiple GPUs. This was caused by the student model being forced to "cuda:0" while the teacher model was defaulting to "auto" device mapping. The fix is to initialize both models with device_map="auto", then transformers handles the multi-gpu case automatically