kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k stars 687 forks source link

Update `huggingface_hub` Version in the storage initializer to fix ImportError #2180

Closed helenxie-bit closed 1 month ago

helenxie-bit commented 2 months ago

What this PR does / why we need it: Due to the update of huggingface_hub, split_torch_state_dict_into_shards is not supported in v0.19.3. Therefore, I updated the version in the requirements.txt for the storage initializer to fix the "ImportError".

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2179

Checklist:

coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10026064625

Details


Totals Coverage Status
Change from base Build 9999203579: 35.4%
Covered Lines: 4377
Relevant Lines: 12365

💛 - Coveralls
helenxie-bit commented 2 months ago

It seems that this error is irrelevant to the huggingface_hub version. Which peft version do you use in your local?

I guess that your local peft version is newer than v0.3.0:

https://github.com/kubeflow/training-operator/blob/f55a91d03f23498cdb465ac26c78566228077c51/sdk/python/kubeflow/storage_initializer/requirements.txt#L1

@tenzen-y My local peft version is 0.3.0:

Name: peft
Version: 0.3.0
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: sourab@huggingface.co
License: Apache
Location: /opt/homebrew/anaconda3/envs/kubeflow/lib/python3.11/site-packages
Requires: accelerate, numpy, packaging, psutil, pyyaml, torch, transformers
Required-by: 

And here is the detailed information of the error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 2, in <module>
    from .hugging_face import HuggingFace, HuggingFaceDataset
  File "/app/storage_initializer/hugging_face.py", line 8, in <module>
    from peft import LoraConfig
  File "/usr/local/lib/python3.11/site-packages/peft/__init__.py", line 22, in <module>
    from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
  File "/usr/local/lib/python3.11/site-packages/peft/mapping.py", line 16, in <module>
    from .peft_model import (
  File "/usr/local/lib/python3.11/site-packages/peft/peft_model.py", line 22, in <module>
    from accelerate import dispatch_model, infer_auto_device_map
  File "/usr/local/lib/python3.11/site-packages/accelerate/__init__.py", line 16, in <module>
    from .accelerator import Accelerator
  File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 34, in <module>
    from huggingface_hub import split_torch_state_dict_into_shards
ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/usr/local/lib/python3.11/site-packages/huggingface_hub/__init__.py)

Do you have any idea where the problem could be?

andreyvelich commented 2 months ago

@tenzen-y @helenxie-bit Getting the same error on my side with huggingface_hub==0.19.3 version I think, this update can be related: https://github.com/kubeflow/training-operator/pull/2056.

@tenzen-y @johnugeorge @deepanker13 Should we move this forward to fix errors in train API ?

Additionally, @helenxie-bit if you could help us with some simple e2e tests for train API that would be amazing!

helenxie-bit commented 2 months ago

Yeah, of course. I can help with the e2e tests.

tenzen-y commented 2 months ago

@tenzen-y @helenxie-bit Getting the same error on my side with huggingface_hub==0.19.3 version I think, this update can be related: #2056.

@tenzen-y @johnugeorge @deepanker13 Should we move this forward to fix errors in train API ?

Additionally, @helenxie-bit if you could help us with some simple e2e tests for train API that would be amazing!

SGTM

google-oss-prow[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[sdk/python/OWNERS](https://github.com/kubeflow/training-operator/blob/master/sdk/python/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment