Open wonkyoc opened 5 days ago
Okay. I found that device_map
actually only offloads the model weight, not the execution as well. If there is a GPU then the GPU is the main priority in executing the model.
Correct, that's how our big model inference works.
cc @SunMarc
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Expected behavior
What I want to see is that my device map works correctly. I put down_blocks.0 on cuda and down_block.1 on CPU but it seems not working how I want.
If you look into a screenshot, down_blocks.1 (CrossAttnDownBlock 1) still uses cudaMalloc on Attention, which shouldn't use this operator. I do see a long copying time, which I believe copying is from cuda -> CPU based on the device map so I thought a hook seems trying to use CPU but I don't understand why a new hook still uses cuda. This happens across other layers, which I set as CPU execution. In other layers, due to this issue, every pre/post forward copies data.
Is this bug?
Another side question is how much data is moved. Although I figured out that
set_module_tensor_to_device()
&send_to_device()
are responsible for data copying, it is not clear for me that these functions copy only the output from the previous child layer or entire layers within a block (e.g., CrossAttnDownBlock).