'hdbscan' module not found; maybe use installed sklearn.cluster.HDBSCAN?

atomiechen commented 5 months ago

❓ Questions and Help

What is your question?

Starting from a fresh container environment equipped with pytorch and funasr (via pip install funasr), I encountered ModuleNotFoundError: No module named 'hdbscan' when I instanciate an AutoModel with a spk model. It originates from the import hdbscan in UmapHdbscan() <- ClusterBackend() <- AutoModel(...).

Must I install hdbscan manually? Is there any other package that I also need in advance?
- I am crafting my own container and I am frustrated to find that I have to build my image again. I see no hint message from the output or doc.
There is a sklearn.cluster.HDBSCAN, and I find sklearn is already there with funasr installed. Can we just use that sklearn one instead of installing the standalone version hdbscan?
- These two versions seem coming from same authors, and differ in some minor ways (see https://github.com/scikit-learn/scikit-learn/issues/27829)

Code

from funasr import AutoModel
model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", model_revision="v2.0.4",
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", vad_model_revision="v2.0.4",
    punc_model="iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", punc_model_revision="v2.0.4",
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", spk_model_revision="v2.0.2",
)

What have you tried?

In a pytorch docker container, run pip install funasr and then the script above.

What's your environment?

OS (e.g., Linux):
FunASR Version (e.g., 1.0.0): 1.0.19
ModelScope Version (e.g., 1.11.0): None (do not need it)
PyTorch Version (e.g., 2.0.0): 2.2.2
How you installed funasr (pip, source): pip
Python version: 3.10.14
GPU (e.g., V100M32): NVIDIA GeForce RTX 4090
CUDA/cuDNN version (e.g., cuda11.7): cuda11.8
Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1): pytorch/pytorch:2.2.2-cuda11.8-cudnn8-runtime
Any other relevant information:

LauraGPT commented 5 months ago

delete all *model_revision, and try it again. All requirements would be installed automatically.

atomiechen commented 5 months ago

Yes, thank you. But basically what I want to do is to build an image with installed packages ahead of running any scripts. I believe I should not figure it out through trial and error by myself.

LauraGPT commented 5 months ago

Yes, thank you. But basically what I want to do is to build an image with installed packages ahead of running any scripts. I believe I should not figure it out through trial and error by myself.

If there exists any errors, please let me know after you delete all *model_revision.

atomiechen commented 5 months ago

If there exists any errors, please let me know after you delete all *model_revision.

Sadly yes.

I removed all *model_revision:

from funasr import AutoModel

model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", 
    punc_model="iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
)

And I still got:

ckpt: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt
ckpt: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
ckpt: iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/model.pt
ckpt: iic/speech_campplus_sv_zh-cn_16k-common/campplus_cn_common.bin
Traceback (most recent call last):
  File "/shared/test-funasr/tmp_test.py", line 10, in <module>
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
  File "/home/user/.local/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 135, in __init__
    self.cb_model = ClusterBackend().to(kwargs["device"])
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 149, in __init__
    self.umap_hdbscan_cluster = UmapHdbscan()
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 118, in __init__
    import hdbscan
ModuleNotFoundError: No module named 'hdbscan'

FunASR Version: 1.0.19

And I cannot even import funasr using the latest commit (702b9b540c3c1524748cd975a10ce33f0fa53912) on main branch:

>>> import funasr
/.../FunASR/funasr/datasets/large_datasets/utils/tokenize.py:93: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if vad is not -2:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../FunASR/funasr/__init__.py", line 36, in <module>
    import_submodules(__name__)
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 25, in import_submodules
    for loader, name, is_pkg in pkgutil.walk_packages(package.__path__, package.__name__ + '.'):
AttributeError: 'str' object has no attribute '__path__'. Did you mean: '__hash__'?

atomiechen commented 5 months ago

Plus: all my models are already there inside the literally iic folder in current directory, so there is no extra downloads. The environment running above script does not have modelscope installed.

Still worth mentioning: during the image building phase one should not use a test script like this to 'trigger' the auto installation of extra dependencies, which is anti-pattern. It needs explicit commands to prepare the environment, like pip install funasr[spk].

LauraGPT commented 5 months ago

If there exists any errors, please let me know after you delete all *model_revision.

Sadly yes.

I removed all *model_revision:

from funasr import AutoModel

model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch", 
    punc_model="iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
)

And I still got:

ckpt: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model.pt
ckpt: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/model.pt
ckpt: iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/model.pt
ckpt: iic/speech_campplus_sv_zh-cn_16k-common/campplus_cn_common.bin
Traceback (most recent call last):
  File "/shared/test-funasr/tmp_test.py", line 10, in <module>
    spk_model="iic/speech_campplus_sv_zh-cn_16k-common", 
  File "/home/user/.local/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 135, in __init__
    self.cb_model = ClusterBackend().to(kwargs["device"])
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 149, in __init__
    self.umap_hdbscan_cluster = UmapHdbscan()
  File "/home/user/.local/lib/python3.10/site-packages/funasr/models/campplus/cluster_backend.py", line 118, in __init__
    import hdbscan
ModuleNotFoundError: No module named 'hdbscan'

FunASR Version: 1.0.19

And I cannot even import funasr using the latest commit (702b9b5) on main branch:

>>> import funasr
/.../FunASR/funasr/datasets/large_datasets/utils/tokenize.py:93: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if vad is not -2:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../FunASR/funasr/__init__.py", line 36, in <module>
    import_submodules(__name__)
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 33, in import_submodules
    results.update(import_submodules(name))
  File "/.../FunASR/funasr/__init__.py", line 25, in import_submodules
    for loader, name, is_pkg in pkgutil.walk_packages(package.__path__, package.__name__ + '.'):
AttributeError: 'str' object has no attribute '__path__'. Did you mean: '__hash__'?

FunASR Version: 1.0.19 You should pip install -e .

atomiechen commented 5 months ago

I mean I tried both ways:

pip install funasr to install the latest pypi version (1.0.19)
pip install -e . after pulling the latest commit of main branch, which results in above error.

LauraGPT commented 5 months ago

I mean I tried both ways:

pip install funasr to install the latest pypi version (1.0.19)

pip install -e . after pulling the latest commit of main branch, which results in above error.

先 pip install -e . 然后把这里注释解除，把报错log出来：https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/__init__.py#L21

LauraGPT commented 5 months ago

I mean I tried both ways:

pip install funasr to install the latest pypi version (1.0.19)

pip install -e . after pulling the latest commit of main branch, which results in above error.

Bug has been fixed. Please update funasr https://github.com/alibaba-damo-academy/FunASR/pull/1580 :

pip pull 
pip install -e .

atomiechen commented 5 months ago

I pulled latest commit, used pip install -e . and uncommnet the print (see screenshot), but found still the same output:

So there is no error reported here.

LauraGPT commented 5 months ago

Requirements would be installed in https://github.com/alibaba-damo-academy/FunASR/blob/main/funasr/download/download_from_hub.py#L76

Maybe you could debug it and show the log.

atomiechen commented 5 months ago

Plus: all my models are already there inside the literally iic folder in current directory, so there is no extra downloads. The environment running above script does not have modelscope installed.

The problem is that models of previous revision (instead of master) is already downloaded in the iic folder, and the code does not check that and will not redownload the latest master revision. So there is no requirements.txt file in the campplus model folder.

atomiechen commented 5 months ago

I now understand that the requirements.txt comes from the model dir. Maybe some mechanism of auto redownloading the specified revision is required?

❓ And also I wonder if this is possible:

2. There is a sklearn.cluster.HDBSCAN, and I find sklearn is already there with funasr installed. Can we just use that sklearn one instead of installing the standalone version hdbscan?

modelscope / FunASR