OOM Errors in Step 3.1 - Githubissues

princeton-nlp / LESS

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning

MIT License

306 stars 25 forks source link

OOM Errors in Step 3.1 #20

Closed ChangyuChen347 closed 2 months ago

ChangyuChen347 commented 2 months ago

Hello, I encounter an Out of Memory error at step 3.1. My configuration includes a single A100 GPU with 80GB of memory. Reducing the max_length to 128 allows me to avoid the OOM error. I would like to know if this is a reasonable approach and if there are any other methods to resolve this issue.

Here is the command I am using: CUDA_VISIBLE_DEVICES=0 python3 -m less.data_selection.get_info \ --task $task \ --info_type grads \ --model_path $model \ --output_path $output_path \ --gradient_projection_dimension $dims \ --gradient_type sgd \ --data_dir $data_dir \ --max_length 128

Haruka1307 commented 2 months ago

I met the same error.Check your log ,do you import fast_jl successfully? If it prints "Use basic projector " then fast_jl didn't import actually.When generating eval_grads, cuda number 0 was used, and the model&dataloader preempted the graphics memory, resulting in an OOM error during loss calculation.You may move the grads to another available in collect_grad_reps.py;

ChangyuChen347 commented 2 months ago

I met the same error.Check your log ,do you import fast_jl successfully? If it prints "Use basic projector " then fast_jl didn't import actually.When generating eval_grads, cuda number 0 was used, and the model&dataloader preempted the graphics memory, resulting in an OOM error during loss calculation.You may move the grads to another available in collect_grad_reps.py;

Thanks. After using the CUDA projector, there are no more OOM issues

DDrShieh commented 1 month ago

@Haruka1307 Have you ever fixed this issue before OOM in step 3.1 ? https://github.com/princeton-nlp/LESS/issues/19#issue-2317088583 Your solution doesn't work in this situation. Any advice for this?