polm / fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
MIT License
389 stars 31 forks source link

cmmap_->open(filename, mode)] cannot open #71

Closed m-hammad-khan closed 1 year ago

m-hammad-khan commented 1 year ago

Hey, I am facing this runtime error when using unidic. Weirdly it was working fine until yesterday and somehow without changing anything in the code I am getting this error. I am using M1 Macbook and installed all the libraries using pip install.

RuntimeError:

Failed initializing MeCab. Please see the README for possible solutions:

https://github.com/polm/fugashi If you are still having trouble, please file an issue here, and include the ERROR DETAILS below:

https://github.com/polm/fugashi/issues issueを英語で書く必要はありません。

ERROR DETAILS

- arguments: [b'fugashi', b'-C', b'-r', b'/dev/null', b'-d', b'~/.local/share/virtualenvs/cQRM94l6/lib/python3.10/site-packages/unidic/dicdir'] 
- viterbi.cpp(54) [connector_->open(param)] connector.cpp(24) [cmmap_->open(filename, mode)] cannot open: ~/.local/share/virtualenvs/cQRM94l6/lib/python3.10/site-packages/unidic/dicdir/matrix.bin 
polm commented 1 year ago

Sorry you're having trouble with this.

Does the listed matrix.bin file exist? Have you tried re-installing?

mmap related failures usually have to do with running out of memory due to creating too many Tagger objects, but in this case it looks like the file may not exist. I have never heard of minato before, but since you mention it, maybe you're using it to cache the UniDic files or something?

m-hammad-khan commented 1 year ago

Yes, the file exists I am using the python -m unidic download command and I have tried re-installing it multiple times but no luck! Yep sorry you can ignore Minato.

m-hammad-khan commented 1 year ago

Here is the code

class Tokenizer:
    def __init__(
        self,
        system_dictionary_path: Optional[Union[str, PathLike]] = None,
        user_dictionary_path: Optional[Union[str, PathLike]] = None,
    ) -> None:
        if system_dictionary_path == "ipadic":
            system_dictionary_path = ipadic.DICDIR
        elif system_dictionary_path == "unidic":
            system_dictionary_path = unidic.DICDIR

        self._system_dictionary_path = system_dictionary_path or unidic.DICDIR
        self._user_dictionary_path = user_dictionary_path

        self._tagger: Optional[fugashi.Tagger] = None

    @classmethod
    def from_config(cls, config: SectionProxy) -> "Tokenizer":
        return Tokenizer(
            system_dictionary_path=config.get("system_dictionary_path"),
            user_dictionary_path=config.get("user_dictionary_path"),
        )

    @property
    def tagger(self) -> fugashi.Tagger:
        # setup tagger
        options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
        if self._user_dictionary_path:
            options.append(f"-u {minato.cached_path(self._user_dictionary_path)}")
        if not self._tagger:
            self._tagger = fugashi.GenericTagger(" ".join(options))
        # setup token parser
        if "ipadic" in str(self._system_dictionary_path):
            self._parse_feature = parse_feature_for_ipadic
        elif "unidic" in str(self._system_dictionary_path):
            self._parse_feature = parse_feature_for_unidic
        else:
            raise ValueError("system_dictionary_path must contain 'ipadic' or 'unidic'")

        return self._tagger

    @staticmethod
    def normalize(text: str) -> str:
        text = jaconv.z2h(text, kana=False, ascii=True, digit=True)
        text = jaconv.h2z(text, kana=True, ascii=False, digit=False)
        text = text.replace("〜", "ー")
        return text

    def tokenize(self, text: str) -> List[Token]:
        return [self._parse_feature(token) for token in self.tagger(text)]

    def __getstate__(self) -> Dict[str, Any]:
        return {
            "system_dictionary_path": self._system_dictionary_path,
            "user_dictionary_path": self._user_dictionary_path,
        }

    def __setstate__(self, state: Dict[str, Any]) -> None:
        self._tagger = None
        self._system_dictionary_path = state["system_dictionary_path"]
        self._user_dictionary_path = state["user_dictionary_path"]
polm commented 1 year ago

What is an example of the actual code that causes the issue? You have provided a class definition but no code using it. Also, you said it was OK to ignore minato, but your example code uses minato to cache the dictionary path...

Does just using this code work?

import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
m-hammad-khan commented 1 year ago

Yes it works

Code:

import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')
    # "feature" is the Unidic feature data as a named tuple

Output:

麩      麩      名詞,普通名詞,一般,*
菓子    菓子    名詞,普通名詞,一般,*
は      は      助詞,係助詞,*,*
、      、      補助記号,読点,*,*
麩      麩      名詞,普通名詞,一般,*
を      を      助詞,格助詞,*,*
主材    主材    名詞,普通名詞,一般,*
料      料      接尾辞,名詞的,一般,*
と      と      助詞,格助詞,*,*
し      為る    動詞,非自立可能,*,*
た      た      助動詞,*,*,*
日本    日本    名詞,固有名詞,地名,国
の      の      助詞,格助詞,*,*
菓子    菓子    名詞,普通名詞,一般,*
。      。      補助記号,句点,*,*
polm commented 1 year ago

OK, in that case it seems like something is wrong with your wrapper class, particularly this line:

options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
m-hammad-khan commented 1 year ago

Ok let me try using unidic.DICDIR directly without minato

m-hammad-khan commented 1 year ago

No luck, actually the tokenizer is working on most of the text but after some time it gets stuck on this error while it tries to open the matrix.bin file. Not sure if it's a memory issue, I have 2.5 million strings to tokenize.

polm commented 1 year ago

The matrix.bin file is only accessed when the Tagger is first created, so it sounds like you're creating multiple taggers. Are you doing something like #35 where you're creating a Tagger inside a loop or something?

You typically don't need more than one Tagger in a whole process, or at most one per thread.

polm commented 1 year ago

Closing because this seems to be a usage issue and there's not enough information to debug it. If you can provide a reproducible example I will take a closer look.

m-hammad-khan commented 1 year ago

It is solved thanks, I was creating multiple instances.

polm commented 1 year ago

Glad you figured it out. You need to be careful when creating multiple instances, as you can quickly run out of memory, which can cause mmap errors.