[BUG] Azure Databricks disk_offload error

vitaliy-sharandin commented 7 months ago

Issues Policy acknowledgement

[X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Azure Databricks

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

mlflow==2.12.1

System information

14.3 ML Cluster Azure DataBricks CLuster
accelerate==0.29.3
peft==0.10.0
torch==2.3.0
torchvision==0.18.0
transformers==4.41.0.dev0

Describe the problem

I encounter a disk_offload error whenever I try to register model in Unity Catalogue.

Tracking information

REPLACE_ME

Code to reproduce issue

catalog = "model_registry"
schema = "default"
model_name = "psy-ai"
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    model_uri="runs:/bcae496e43a0496da7a6eafe0ab569d8/NousResearch/Meta-Llama-3-8B-Instruct-peft-trained",
    name=f"{catalog}.{schema}.{model_name}"
)

Stack trace

MlflowException: Failed to download the model weights from the HuggingFace hub and cannot register the model in the Unity Catalog. Please ensure that the model was saved with the correct reference to the HuggingFace hub repository and that you have access to fetch model weights from the defined repository.
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:627, in UcModelRegistryStore._download_model_weights_if_not_saved(self, local_model_path)
    626 try:
--> 627     mlflow.transformers.persist_pretrained_model(local_model_path)
    628 except Exception as e:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1080, in persist_pretrained_model(model_uri)
   1079 local_model_path = artifact_repo.download_artifacts(artifact_path, dst_path=tmp_dir.path())
-> 1080 pipeline = load_model(local_model_path, return_type="pipeline")
   1082 # Update MLModel flavor config
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/utils/docstring_utils.py:379, in docstring_version_compatibility_warning.<locals>.annotated_func.<locals>.version_func(*args, **kwargs)
    378     warnings.warn(notice, category=FutureWarning, stacklevel=2)
--> 379 return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1023, in load_model(model_uri, dst_path, return_type, device, **kwargs)
   1021 _add_code_from_conf_to_system_path(local_model_path, flavor_config)
-> 1023 return _load_model(local_model_path, flavor_config, return_type, device, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1211, in _load_model(path, flavor_config, return_type, device, **kwargs)
   1210 if peft_adapter_dir := flavor_config.get(FlavorKey.PEFT, None):
-> 1211     model_and_components[FlavorKey.MODEL] = get_model_with_peft_adapter(
   1212         base_model=model_and_components[FlavorKey.MODEL],
   1213         peft_adapter_path=os.path.join(path, peft_adapter_dir),
   1214     )
   1216 conf = {**conf, **model_and_components}
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/peft.py:50, in get_model_with_peft_adapter(base_model, peft_adapter_path)
     48 from peft import PeftModel
---> 50 return PeftModel.from_pretrained(base_model, peft_adapter_path)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/peft/peft_model.py:356, in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, **kwargs)
    355     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config, adapter_name)
--> 356 model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
    357 return model
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/peft/peft_model.py:760, in PeftModel.load_adapter(self, model_id, adapter_name, is_trainable, **kwargs)
    757     device_map = infer_auto_device_map(
    758         self, max_memory=max_memory, no_split_module_classes=no_split_module_classes
    759     )
--> 760 dispatch_model(
    761     self,
    762     device_map=device_map,
    763     offload_dir=offload_dir,
    764     **dispatch_model_kwargs,
    765 )
    766 hook = AlignDevicesHook(io_same_device=True)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/accelerate/big_modeling.py:490, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks)
    489     else:
--> 490         raise ValueError(
    491             "You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead."
    492         )
    493 # Convert OrderedDict back to dict for easier usage
ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.

The above exception was the direct cause of the following exception:
MlflowException                           Traceback (most recent call last)
File <command-3799569696053756>, line 5
      3 model_name = "psy-ai"
      4 mlflow.set_registry_uri("databricks-uc")
----> 5 mlflow.register_model(
      6     model_uri="runs:/bcae496e43a0496da7a6eafe0ab569d8/NousResearch/Meta-Llama-3-8B-Instruct-peft-trained",
      7     name=f"{catalog}.{schema}.{model_name}"
      8 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/fluent.py:77, in register_model(model_uri, name, await_registration_for, tags)
     17 def register_model(
     18     model_uri,
     19     name,
   (...)
     22     tags: Optional[Dict[str, Any]] = None,
     23 ) -> ModelVersion:
     24     """Create a new model version in model registry for the model files specified by ``model_uri``.
     25 
     26     Note that this method assumes the model registry backend URI is the same as that of the
   (...)
     75         Version: 1
     76     """
---> 77     return _register_model(
     78         model_uri=model_uri, name=name, await_registration_for=await_registration_for, tags=tags
     79     )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/fluent.py:112, in _register_model(model_uri, name, await_registration_for, tags, local_model_path)
    109     source = RunsArtifactRepository.get_underlying_uri(model_uri)
    110     (run_id, _) = RunsArtifactRepository.parse_runs_uri(model_uri)
--> 112 create_version_response = client._create_model_version(
    113     name=name,
    114     source=source,
    115     run_id=run_id,
    116     tags=tags,
    117     await_creation_for=await_registration_for,
    118     local_model_path=local_model_path,
    119 )
    120 eprint(
    121     f"Created version '{create_version_response.version}' of model "
    122     f"'{create_version_response.name}'."
    123 )
    124 return create_version_response
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/client.py:2861, in MlflowClient._create_model_version(self, name, source, run_id, tags, run_link, description, await_creation_for, local_model_path)
   2853     # NOTE: we can't easily delete the target temp location due to the async nature
   2854     # of the model version creation - printing to let the user know.
   2855     eprint(
   2856         f"=== Source model files were copied to {new_source}"
   2857         + " in the model registry workspace. You may want to delete the files once the"
   2858         + " model version is in 'READY' status. You can also find this location in the"
   2859         + " `source` field of the created model version. ==="
   2860     )
-> 2861 return self._get_registry_client().create_model_version(
   2862     name=name,
   2863     source=new_source,
   2864     run_id=run_id,
   2865     tags=tags,
   2866     run_link=run_link,
   2867     description=description,
   2868     await_creation_for=await_creation_for,
   2869     local_model_path=local_model_path,
   2870 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:215, in ModelRegistryClient.create_model_version(self, name, source, run_id, tags, run_link, description, await_creation_for, local_model_path)
    213 arg_names = _get_arg_names(self.store.create_model_version)
    214 if "local_model_path" in arg_names:
--> 215     mv = self.store.create_model_version(
    216         name,
    217         source,
    218         run_id,
    219         tags,
    220         run_link,
    221         description,
    222         local_model_path=local_model_path,
    223     )
    224 else:
    225     # Fall back to calling create_model_version without
    226     # local_model_path since old model registry store implementations may not
    227     # support the local_model_path argument.
    228     mv = self.store.create_model_version(name, source, run_id, tags, run_link, description)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:714, in UcModelRegistryStore.create_model_version(self, name, source, run_id, tags, run_link, description, local_model_path)
    712 with self._local_model_dir(source, local_model_path) as local_model_dir:
    713     self._validate_model_signature(local_model_dir)
--> 714     self._download_model_weights_if_not_saved(local_model_dir)
    715     feature_deps = get_feature_dependencies(local_model_dir)
    716     other_model_deps = get_model_version_dependencies(local_model_dir)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:629, in UcModelRegistryStore._download_model_weights_if_not_saved(self, local_model_path)
    627     mlflow.transformers.persist_pretrained_model(local_model_path)
    628 except Exception as e:
--> 629     raise MlflowException(
    630         "Failed to download the model weights from the HuggingFace hub and cannot register "
    631         "the model in the Unity Catalog. Please ensure that the model was saved with the "
    632         "correct reference to the HuggingFace hub repository and that you have access to "
    633         "fetch model weights from the defined repository.",
    634         error_code=INTERNAL_ERROR,
    635     ) from e

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

harupy commented 6 months ago

@vitaliy-sharandin Thanks for reporting this. Could you share your model logging code?

harupy commented 6 months ago

I ran the following code but could not reproduce the error:

%pip install -U git+https://github.com/huggingface/transformers torch accelerate==0.29.3 mlflow

dbutils.library.restartPython()

########

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    token="...",
)

import mlflow
import uuid

mlflow.set_registry_uri("databricks-uc")

with mlflow.start_run() as run:
  mlflow.transformers.log_model(pipeline, "model")

mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name=f"..."
)

vitaliy-sharandin commented 6 months ago

The main difference between our code is that I am first fine-tuning adapters with peft and trying to register the run which has only adapters saved and base model reference without model weights. I have also read MLFLow Transformers guide which specifies that you don't need to use mlflow.transformers.persist_pretrained_model() once you are trying to register model to Unity Catalogue, hence my code has to work as I am trying to do exactly that.

Here is my notebook: https://github.com/vitaliy-sharandin/data_science_projects/blob/master/portfolio/nlp/fine-tuned-llm/psy_ai_mlflow_tracking_deployment.ipynb

harupy commented 6 months ago

Thanks for the notebook! Let me run the notebook and see If I can reproduce the issue.

harupy commented 6 months ago

@vitaliy-sharandin Can you try inserting this code before loading the model to see if it can fix the error?

def get_model_with_peft_adapter(base_model, peft_adapter_path):
    from peft import PeftModel

    return PeftModel.from_pretrained(base_model, peft_adapter_path, offload_folder="offload")

mlflow.transformers.get_model_with_peft_adapter = get_model_with_peft_adapter

Not sure if offload_folder is the only to fix this issue, but want to give it a try.

vitaliy-sharandin commented 6 months ago

It doesn't quite make sense, as I don't have adapters to load pre-model-tuning, so I don't have value for peft_adapter_path obligatory argument.

github-actions[bot] commented 6 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

harupy commented 6 months ago

@vitaliy-sharandin the traceback says get_model_with_peft_adapter is called.

vitaliy-sharandin commented 6 months ago

@harupy Sorry, I have misunderstood your code at first. I did what you've proposed and it led to new error, please check out the notebook.

vitaliy-sharandin commented 6 months ago

@harupy Any updates?

mlflow / mlflow