[Fix] Resolve Errors When Loading Multiple Quantized Models

rahul-tuli commented 7 months ago

Description

A bug was discovered in the main branch where loading two or more quantized models sequentially with SparseAutoModelForCausalLM.from_pretrained(...) led to errors. This occurred because the active_session was not reset between the loads, causing conflicts during subsequent recipe applications.

Proposed Fix

Implemented a fix by wrapping recipe applications within session_context_manager(), ensuring active_session is reset before each model's recipe is applied. This change isolates each model loading process, preventing the previously observed errors.

Testing

Verified the fix by successfully loading multiple quantized models sequentially, which previously resulted in errors. The use of session_context_manager() before each recipe application has resolved the issue.

Test script:

import sparseml.core.session as session_manager
from sparseml.transformers import SparseAutoModelForCausalLM

model_path = "mgoin/llama2.c-stories15M-quant-pt"

m1 = SparseAutoModelForCausalLM.from_pretrained(model_path)
m2 = SparseAutoModelForCausalLM.from_pretrained(model_path)

Before the fix:

"/nm/drive0/rahul/projects/sparseml/src/sparseml/modifiers/quantization/utils/quantize.py", line 189, in set_quantization_schemes
    raise_if_already_quantized(submodule_name, submodule)
  File "/nm/drive0/rahul/projects/sparseml/src/sparseml/modifiers/quantization/utils/quantize.py", line 386, in raise_if_already_quantized
    raise RuntimeError(
RuntimeError: Unable to quantize module model.embed_tokens, as it has already been quantized. Ensure your input recipe does not contain multiple QuantizationModifiers that act on the same module.

After the fix:

2024-04-05 18:52:04 sparseml.pytorch.model_load.helpers INFO     Reloaded model state after SparseML recipe structure modifications from /home/rahul/.cache/huggingface/hub/models--mgoin--llama2.c-stories15M-quant-pt/snapshots/aa70fc9dc46615b68f935fb5405ae7875b88b716

rahul-tuli commented 7 months ago

I think this is going to break recipe export and probably a lot of other things, the session is meant to persist through the lifecycle of a model so that when we save the recipe all previously applied modifiers persist. It was never really designed to work for multiple models at a time, if this is something that is important to support, we need to be more careful about it. Does using the session context manager in the user script not work here?

My bad, you are correct an alternative would be to wrap each model creation call within this session context manager, but I don't think that's any better than calling active_session().reset(), @mgoin do you think adding an optional flag like reset_session in the from_pretrained(...) method would be nicer? if not I'll close this PR and we'll think of a nicer way if this is an important use case.

rahul-tuli commented 7 months ago

Closing this out for now, awaiting response from product!

mgoin commented 7 months ago

@rahul-tuli @Satrat Loading 2 quantized/modified models is a basic usage of our library imo and it will be easy for users to trip on this in long-lived notebook, distillation, or model comparison scenarios. At a high-level I think it is reasonable to assume that model = SparseAutoModelForCausalLM.from_pretrained(model_path) would have no global side-effects, which doesn't seem to currently be the case based on this issue. Maybe attaching the session to the model as a member variable rather than relying on a global session would help get past this. I don't know the whole architecture of why session is used like this currently, but we should continue to discuss if it can't bend easily to support this use case. FYI @robertgshaw2-neuralmagic @bfineran

robertgshaw2-neuralmagic commented 7 months ago

I agree with @mgoin

neuralmagic / sparseml