Closed m-hammad-khan closed 1 year ago
Sorry you're having trouble with this.
Does the listed matrix.bin
file exist? Have you tried re-installing?
mmap
related failures usually have to do with running out of memory due to creating too many Tagger objects, but in this case it looks like the file may not exist. I have never heard of minato
before, but since you mention it, maybe you're using it to cache the UniDic files or something?
Yes, the file exists I am using the python -m unidic download
command and I have tried re-installing it multiple times but no luck! Yep sorry you can ignore Minato.
Here is the code
class Tokenizer:
def __init__(
self,
system_dictionary_path: Optional[Union[str, PathLike]] = None,
user_dictionary_path: Optional[Union[str, PathLike]] = None,
) -> None:
if system_dictionary_path == "ipadic":
system_dictionary_path = ipadic.DICDIR
elif system_dictionary_path == "unidic":
system_dictionary_path = unidic.DICDIR
self._system_dictionary_path = system_dictionary_path or unidic.DICDIR
self._user_dictionary_path = user_dictionary_path
self._tagger: Optional[fugashi.Tagger] = None
@classmethod
def from_config(cls, config: SectionProxy) -> "Tokenizer":
return Tokenizer(
system_dictionary_path=config.get("system_dictionary_path"),
user_dictionary_path=config.get("user_dictionary_path"),
)
@property
def tagger(self) -> fugashi.Tagger:
# setup tagger
options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
if self._user_dictionary_path:
options.append(f"-u {minato.cached_path(self._user_dictionary_path)}")
if not self._tagger:
self._tagger = fugashi.GenericTagger(" ".join(options))
# setup token parser
if "ipadic" in str(self._system_dictionary_path):
self._parse_feature = parse_feature_for_ipadic
elif "unidic" in str(self._system_dictionary_path):
self._parse_feature = parse_feature_for_unidic
else:
raise ValueError("system_dictionary_path must contain 'ipadic' or 'unidic'")
return self._tagger
@staticmethod
def normalize(text: str) -> str:
text = jaconv.z2h(text, kana=False, ascii=True, digit=True)
text = jaconv.h2z(text, kana=True, ascii=False, digit=False)
text = text.replace("〜", "ー")
return text
def tokenize(self, text: str) -> List[Token]:
return [self._parse_feature(token) for token in self.tagger(text)]
def __getstate__(self) -> Dict[str, Any]:
return {
"system_dictionary_path": self._system_dictionary_path,
"user_dictionary_path": self._user_dictionary_path,
}
def __setstate__(self, state: Dict[str, Any]) -> None:
self._tagger = None
self._system_dictionary_path = state["system_dictionary_path"]
self._user_dictionary_path = state["user_dictionary_path"]
What is an example of the actual code that causes the issue? You have provided a class definition but no code using it. Also, you said it was OK to ignore minato, but your example code uses minato to cache the dictionary path...
Does just using this code work?
import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
Yes it works
Code:
import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
print(word, word.feature.lemma, word.pos, sep='\t')
# "feature" is the Unidic feature data as a named tuple
Output:
麩 麩 名詞,普通名詞,一般,*
菓子 菓子 名詞,普通名詞,一般,*
は は 助詞,係助詞,*,*
、 、 補助記号,読点,*,*
麩 麩 名詞,普通名詞,一般,*
を を 助詞,格助詞,*,*
主材 主材 名詞,普通名詞,一般,*
料 料 接尾辞,名詞的,一般,*
と と 助詞,格助詞,*,*
し 為る 動詞,非自立可能,*,*
た た 助動詞,*,*,*
日本 日本 名詞,固有名詞,地名,国
の の 助詞,格助詞,*,*
菓子 菓子 名詞,普通名詞,一般,*
。 。 補助記号,句点,*,*
OK, in that case it seems like something is wrong with your wrapper class, particularly this line:
options = ["-r /dev/null", f"-d {minato.cached_path(self._system_dictionary_path)}"]
Ok let me try using unidic.DICDIR directly without minato
No luck, actually the tokenizer is working on most of the text but after some time it gets stuck on this error while it tries to open the matrix.bin file. Not sure if it's a memory issue, I have 2.5 million strings to tokenize.
The matrix.bin
file is only accessed when the Tagger is first created, so it sounds like you're creating multiple taggers. Are you doing something like #35 where you're creating a Tagger inside a loop or something?
You typically don't need more than one Tagger in a whole process, or at most one per thread.
Closing because this seems to be a usage issue and there's not enough information to debug it. If you can provide a reproducible example I will take a closer look.
It is solved thanks, I was creating multiple instances.
Glad you figured it out. You need to be careful when creating multiple instances, as you can quickly run out of memory, which can cause mmap errors.
Hey, I am facing this runtime error when using unidic. Weirdly it was working fine until yesterday and somehow without changing anything in the code I am getting this error. I am using M1 Macbook and installed all the libraries using pip install.
RuntimeError:
Failed initializing MeCab. Please see the README for possible solutions:
https://github.com/polm/fugashi If you are still having trouble, please file an issue here, and include the ERROR DETAILS below:
https://github.com/polm/fugashi/issues issueを英語で書く必要はありません。
ERROR DETAILS