deepspeed zero3 NVMe offload is not working on Paligemma

eljandoubi commented 3 weeks ago

System Info

transformers==4.46.0 accelerate==1.0.1 sentencepiece==0.2.0 deepspeed==0.15.3

Who can help?

@muellerz @SunMarc @ArthurZucker @amyeroberts @qubvel

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

accelerate configr: compute_environment: LOCAL_MACHINE
debug: false deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: auto gradient_clipping: 1.0 offload_optimizer_device: nvme offload_param_device: nvme zero3_init_flag: true zero3_save_16bit_model: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_process_ip: 0.0.0.0 main_process_port: 0 main_training_function: main mixed_precision: bf16 num_machines: 3 num_processes: 24 rdzv_backend: c10d same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

================================================

I launch the code using

accelerate launch --config_file config.yml code.py

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig

model_kwgs = {
     "pretrained_model_name_or_path": "local_folder/contains/paligemma-3b-pt-896",
     "trust_remote_code":True,
     }

model = AutoModelForVision2Seq.from_pretrained(**model_kwgs)

Expected behavior

Have a NVMe offloaded model.

ArthurZucker commented 2 weeks ago

Hey! is there a reason why you are using trust_remote_code = True? This would use the online code, not the transformers native one!

eljandoubi commented 2 weeks ago

@ArthurZucker I set trust_remote_code = True to bypass the warnings, but it had no effect on the error.

huggingface / transformers