mocobeta / janome

Japanese morphological analysis engine written in pure Python
https://mocobeta.github.io/janome/en/
Apache License 2.0
855 stars 51 forks source link

Too many open files #100

Closed ghost closed 3 years ago

ghost commented 3 years ago

I have Janome of version 0.4.1. I created many Janome tokenizers and I received error the following.

Traceback (most recent call last):
  File "janome_test.py", line 6, in <module>
  File "/tmp/test/venv/lib/python3.8/site-packages/janome/tokenizer.py", line 177, in __init__
  File "/tmp/test/venv/lib/python3.8/site-packages/janome/sysdic/__init__.py", line 92, in mmap_entries
OSError: [Errno 24] Too many open files

The detail code is the following.

from janome.tokenizer import Tokenizer

ts = []

for i in range(100):
    ts.append(Tokenizer())

for t in ts:
    t.tokenize('すもももももももものうち')
mocobeta commented 3 years ago

I'm not sure what you want here. The tokenizer uses mmap system call so it needs file descriptors; you can avoid "too many open files" error by increasing the system's max file descriptors limit - https://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

But why do you need to create so many Tokenizer instances.

mocobeta commented 3 years ago

Actually it's not about specification but system resource limitation that varies depending on indivisual system (like CPU or memory or disk space you can use).

About documentation - It's documented that Tokenizer utilizes memory mapped file feature; I think it's enough for users who has basic OS knowledge. We could document about configuring the ulimit value if it's a common use case to create many tokenizer objects at once, but I don't think we need to do so.

mocobeta commented 3 years ago

Please check the documentation before asking questions. (We are not a support desk. ;) https://mocobeta.github.io/janome/#memory-mapped-file-v0-3-3 https://mocobeta.github.io/janome/en/#memory-mapped-file-support-v0-3-3