Closed markuszoeller closed 4 years ago
That's unfortunate, and I hadn't seen the notice. Thanks for the heads-up.
I'm definitely open to suggestions for alternative libraries to do something similar, even using different backends.
Maybe hunspell
[1] is worth a try. It seems to be still maintained
and has a ~400 github stars. There's also a pyhunspell
python
wrapper [2] for it. Hunspell claims to be used in many applications:
Hunspell is the spell checker of LibreOffice, Mozilla Firefox 3 & Thunderbird, Google Chrome, and it is also used by proprietary software packages, like macOS, InDesign, memoQ, Opera and SDL Trados.
I'm not sure how to verify that or double-check that it's still current, but if this holds true, it might be a good investment.
References: [1] https://github.com/hunspell/hunspell [2] https://github.com/blatinier/pyhunspell
2018-03-01 19:06 GMT+01:00 Doug Hellmann notifications@github.com:
That's unfortunate, and I hadn't seen the notice. Thanks for the heads-up.
I'm definitely open to suggestions for alternative libraries to do something similar, even using different backends.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sphinx-contrib/spelling/issues/13#issuecomment-369678663, or mute the thread https://github.com/notifications/unsubscribe-auth/AHZuFaEB-IQRVXt9aHT0e1DJHfuwI_Y_ks5taDiqgaJpZM4SYHsX .
That could be a good option. I don't have a lot of time to work on the change myself, but if you want to work on it I can commit to reviewing the code.
I'm going to spend 3-4h on my Friday afternoon to take a look. Unfortunately, every promise beyond that would be an empty one. I'll update this issue when I have news.
2018-03-02 16:58 GMT+01:00 Doug Hellmann notifications@github.com:
That could be a good option. I don't have a lot of time to work on the change myself, but if you want to work on it I can commit to reviewing the code.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sphinx-contrib/spelling/issues/13#issuecomment-369962510, or mute the thread https://github.com/notifications/unsubscribe-auth/AHZuFVewjzJFHRgMk2ScHXJS2vOUhU7oks5taWw0gaJpZM4SYHsX .
No worries, I appreciate any time you do have to put toward it. If not one of us, perhaps another user would have more time.
My current results are below. pyhunspell looks good. I wasn't able to produce a reviewable PR in my time box. Maybe the content below helps others to make progress:
setup
I wanted to have an untainted operating system, so I used a Docker container for my tests:
$ docker run -it ubuntu:16.04 bash
root@1390a024a2be:# apt update
root@1390a024a2be:# apt install python-pip hunspell libhunspell-dev
root@1390a024a2be:# pip install hunspell
spell check
The basic spell check is a True/False validation. Custom words can be added to the runtime dictionary (it doesn't get written into the *.dic file):
>>> import hunspell
>>>
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>>
>>> hobj.spell("dockerized")
False
>>> hobj.add("dockerized")
0L
>>> hobj.spell("dockerized")
True
>>>
wordlist
The wordlist must be a dict file, with the number of words in the first line, otherwise this error comes up:
head error: line 1: missing or bad word count in the dic file
So I added the word count manually:
root@1390a024a2be:/spelling# cat wordlist.dic
1
dockerized
root@1390a024a2be:/spelling#
Now you can add it as person dictionary:
root@1390a024a2be:/spelling# python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> hobj.add_dic("/spelling/wordlist.dic")
0L
>>> hobj.spell("dockerized")
True
>>>
This means the current way of having a plain wordlist.txt
without the
word counter is not supported. Maybe that can be changed during runtime,
or the add
method gets called for every word.
tokenizer
I did not find a tokenizer function in hunspell
. Maybe that needs to be implemented in this project here when it doesn't want to use the tokenizer of enchant anymore.
Unfortunately I cannot make any promise to go further from here. IOW, whoever reads this, don't wait for me.
Thanks for the detailed notes! It's really helpful to have all of that information.
The difference in the file format probably means we need the extension to read the file and call add() to be backwards compatible. Perhaps if the file has a .dic extension instead of .txt we could skip that and use add_dic() directly.
Meanwhile I've created PR #14 to fix the broken links and add a note that PyEnchant is unmaintained.
Regarding tokenization: enchant (the C++ library) does not contain a tokenizer, this is provided by pyenchant.
The code from pyenchant can probably be copied to this project and the import
statement changed
pyenchant is now unmaintained. That does not mean it is not working anymore or it will disappear. In https://github.com/rfk/pyenchant/issues/129, the owner says he is looking for a new maintainer. That might be easier than restructuring this sphinx extension.
I just took a peek at the PyEnchant project and it looks very much alive :smile:
Should this issue be closed considering PyEnchant has been active?
Oops, yes, thanks!
In case it's not known yet: It seems that pyenchant, the basis of this project, is not maintained anymore [1][2]. I'm not sure how a path forward might look like. Just wanted to let you know.
References: [1] https://github.com/rfk/pyenchant/commit/4df35b72a685505546998fadfd0aeaa4cc530429 [2] https://rfk.id.au/blog/entry/archiving-open-source-projects/