line / LINE-DistilBERT-Japanese

DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
Apache License 2.0
44 stars 1 forks source link

Unnecessary packages are required for tokenization #1

Closed shirayu closed 1 year ago

shirayu commented 1 year ago

I understood that this model does not require rhoknp by reading the following descriptions in README.

The texts are first tokenized by MeCab with the Unidic dictionary and then split into subwords by the SentencePiece algorithm.

However, the following error was raised.

    tokenizer = AutoTokenizer.from_pretrained(
  File "/path/to/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 663, in from_pretrained
    tokenizer_class = get_class_from_dynamic_module(
  File "/path/to/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 388, in get_class_from_dynamic_module
    final_module = get_cached_module_file(
  File "/path/to/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 269, in get_cached_module_file
    modules_needed = check_imports(resolved_module_file)
  File "/path/to/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 139, in check_imports
    raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: sudachipy, fugashi, unidic_lite, unidic, ipadic, rhoknp. Run `pip install sudachipy fugashi unidic_lite unidic ipadic rhoknp`
kajyuuen commented 1 year ago

Thanks for your issue! We made modifications to the system so that it can be used in a minimal environment.

https://huggingface.co/line-corporation/line-distilbert-base-japanese/discussions/1/files

shirayu commented 1 year ago

Great! Thank you for the quick improvement.

The error message changed as follows.

ImportError: This modeling file requires the following packages that were not found in your environment: fugashi, unidic, unidic_lite. Run `pip install fugashi unidic unidic_lite`

Either unidic or unidic_lite should be installed as a dictionary. However, both are required to be installed. Although the current code in dynamic_module_utils.py checks all import statements, so this may be difficult to deal with. https://github.com/huggingface/transformers/blob/0558914dff91963f0488bd28747cdd45e933e7a4/src/transformers/dynamic_module_utils.py#L112-L144

(A hack using name="unidic" and import_module(name) may bypass this check.)

kajyuuen commented 1 year ago

Thank you. I will make the change to allow only unidic_lite, since it is basically assumed that unidic_lite will be used as well as the default. But this tokenizer require fugashi and unidic_lite at least, I add this to README.