hatchetProject / QuEST

QuEST: Efficient Finetuning for Low-bit Diffusion Models
29 stars 2 forks source link

Model Initialization Time Inquiry #5

Closed mason5957 closed 5 months ago

mason5957 commented 5 months ago

Thank you for your work. However, Is it normal for the model initialization process to take up to an hour?

image

hatchetProject commented 5 months ago

Hi, it depends on the model size and the CPU you run on. It is normal to take up to an hour for initialization, due to the parameter search for channelwise initialization.

mason5957 commented 5 months ago

@hatchetProject Understood, thank you for your response.

However, I encountered an error while performing the ImageNet calibration, specifically during the block reconstruction phase. Could you please advise on how to resolve this issue? ` 04/08/2024 06:18:02 - INFO - qdiff.layer_recon - Total loss: 0.817 (rec:0.817, round:0.000) b=2.00 count=20000 04/08/2024 06:18:02 - INFO - main - transformer_blocks False 04/08/2024 06:18:02 - INFO - main - 0 True 04/08/2024 06:18:02 - INFO - main - Reconstruction for block 0 cond True 04/08/2024 06:18:02 - INFO - qdiff.block_recon - Saving 10 intermediate results to disk to avoid OOM 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 19.95it/s] 04/08/2024 06:18:04 - INFO - qdiff.utils - in 1 shape: torch.Size([200, 1024, 384]), in 2 shape: torch.Size([200, 1, 512]) 04/08/2024 06:18:04 - INFO - qdiff.utils - out shape: torch.Size([200, 1024, 384]) 04/08/2024 06:18:07 - INFO - qdiff.block_recon - Saving 10 intermediate results to disk to avoid OOM ...

Traceback (most recent call last): File "sample_diffusion_ldm_imagenet.py", line 596, in recon_model(qnn) File "sample_diffusion_ldm_imagenet.py", line 592, in recon_model recon_model(module) File "sample_diffusion_ldm_imagenet.py", line 592, in recon_model recon_model(module) File "sample_diffusion_ldm_imagenet.py", line 592, in recon_model recon_model(module) [Previous line repeated 2 more times] File "sample_diffusion_ldm_imagenet.py", line 590, in recon_model block_reconstruction(qnn, module, *kwargs) File "/home/test01/test/QuEST/qdiff/block_recon.py", line 165, in block_reconstruction err.backward(retain_graph=True) File "/home/test01/miniconda3/envs/EDA-DM/lib/python3.8/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/home/test01/miniconda3/envs/EDA-DM/lib/python3.8/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/test01/miniconda3/envs/EDA-DM/lib/python3.8/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, args) File "/home/test01/test/QuEST/ldm/modules/diffusionmodules/util.py", line 132, in backward ctx.input_tensors = [x.detach().requiresgrad(True) for x in ctx.input_tensors] File "/home/test01/test/QuEST/ldm/modules/diffusionmodules/util.py", line 132, in ctx.input_tensors = [x.detach().requiresgrad(True) for x in ctx.input_tensors] AttributeError: 'numpy.int64' object has no attribute 'detach' `

Additionally, I noticed that in quant_model.py, setattr is called twice, which seems a bit unusual. Could you please review this?

def quant_block_refactor(self, module, weight_quant_params, act_quant_params, timewise, list_timesteps): for name, child_module in module.named_children(): if type(child_module) in self.specials: if self.specials[type(child_module)] in [QuantBasicTransformerBlock]: setattr(module, name, self.specials[type(child_module)](child_module, act_quant_params, sm_abit=self.sm_abit, timewise=timewise, list_timesteps=list_timesteps)) setattr(module, name, self.specials[type(child_module)](child_module, act_quant_params)) else: self.quant_block_refactor(child_module, weight_quant_params, act_quant_params, timewise, list_timesteps)

Thank you very much!!

hatchetProject commented 5 months ago

Hi, I haven't encountered this issue before. Based on your message, I suspect that this is originated from the "checkpoint" usage in the QuantResBlock() class, and you can check the data type of the input to make sure they are correct (torch type instead of numpy). Just in case if it is a package error, I am using torch version 1.13.1, timm 0.4.12.

Thanks for pointing out the error of "setattr". I have updated the qdiff/quant_model.py file. Please use the new one :)

mason5957 commented 5 months ago

@hatchetProject Hello, I have encountered a question regarding whether 'err += loss_func(activation[k], activation_fp[k][ihead:(i+1)head].cuda()) ' in post_layer_recon_imagenet.py needs to be uncommented.

P.S. This line is commented in post_layer_recon_uncond.py Thank you very much!!

image

hatchetProject commented 5 months ago

No, it doesn't need to be uncommented in pd_optimize_timewise(), you can comment the above line 103 as well. Typically including this or not does not influence much. The line is only uncommented for Stable Diffusion to provide finer-grained alignment.

mason5957 commented 5 months ago

Got it, thanks a lot.

cantbebetter2 commented 4 months ago

@mason5957 Hi,I also encountered the problem with

AttributeError: 'numpy.int64' object has no attribute 'detach'

you mentioned above. How did you fix this bug?

mason5957 commented 4 months ago

@cantbebetter2 Hi, I have changed the class CheckpointFunction in ldm/modules/diffusionmodules/util.py aT line 190 into

`class CheckpointFunction(torch.autograd.Function): @staticmethod def forward(ctx, run_function, length, *args): ctx.run_function = run_function ctx.input_tensors = list(args[:length]) ctx.input_params = list(args[length:])

    with torch.no_grad():
        output_tensors = ctx.run_function(*ctx.input_tensors)
    return output_tensors

@staticmethod
def backward(ctx, *output_grads):
    # print("ctx.input_tensors: ",ctx.input_tensors)
    # print(type(ctx.input_tensors))
    # for x in ctx.input_tensors:
    #     print(x.dtype)
    #     print(x)

    # assume ctx.input_tensors include NumPy arrays and PyTorch Tensors
    ctx.input_tensors = [
        x if isinstance(x, torch.Tensor) 
        else torch.tensor(x, dtype=torch.float32).requires_grad_(True) if isinstance(x, (np.ndarray, np.generic)) 
        else x 
        for x in ctx.input_tensors
    ]

    ctx.input_tensors = [x.detach().requires_grad_(True) for x in ctx.input_tensors]
    with torch.enable_grad():
        # Fixes a bug where the first op in run_function modifies the
        # Tensor storage in place, which is not allowed for detach()'d
        # Tensors.

        shallow_copies = [x.view_as(x) for x in ctx.input_tensors]
        # shallow_copies = [x.view_as(x) if idx != 2 else x.view_as(x).item() for idx, x in enumerate(ctx.input_tensors)]
        # print("shallow_copies[0]:", shallow_copies[0])
        # print("shallow_copies[1]:", shallow_copies[1])
        # print("shallow_copies[2]:", shallow_copies[2])

        output_tensors = ctx.run_function(*shallow_copies)
    input_grads = torch.autograd.grad(
        output_tensors,
        ctx.input_tensors + ctx.input_params,
        output_grads, 
        allow_unused=True,
    )
    del ctx.input_tensors
    del ctx.input_params
    del output_tensors
    return (None, None) + input_grads`
cantbebetter2 commented 4 months ago

@mason5957 Thanks a lot! I was surprised that you responded so quickly, it definitely solved my problem. By the way, may I ask how much time it takes for one PTQ calibration. I successfully run the script for imagenet without pd_optimize_timeembed and pd_optimize_timewise, and it already takes much longer time than other PTQ methods.

mason5957 commented 4 months ago

@cantbebetter2 Not at all. For me, the entire process typically takes about 2 days to complete.

hatchetProject commented 4 months ago

@mason5957 @cantbebetter2

The time cost is mainly at the reconstruction stage (for the weights). Typically we just load the weight quantization parameters and then calculate the time cost and performance. It is also valid to skip the reconstruction stage, sometimes with negligible performance degradation.

If you run without pd_optimize_timeembed and pd_optimize_timewise (w8a8 and w4a8), it is just QDiffusion without activation reconstruction, which cannot take more time.