sphinx-contrib / spelling

A spelling checker for Sphinx-based documentation
https://sphinxcontrib-spelling.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
82 stars 41 forks source link

pyenchant is unmaintained: path forward? #13

Closed markuszoeller closed 4 years ago

markuszoeller commented 6 years ago

In case it's not known yet: It seems that pyenchant, the basis of this project, is not maintained anymore [1][2]. I'm not sure how a path forward might look like. Just wanted to let you know.

References: [1] https://github.com/rfk/pyenchant/commit/4df35b72a685505546998fadfd0aeaa4cc530429 [2] https://rfk.id.au/blog/entry/archiving-open-source-projects/

dhellmann commented 6 years ago

That's unfortunate, and I hadn't seen the notice. Thanks for the heads-up.

I'm definitely open to suggestions for alternative libraries to do something similar, even using different backends.

markuszoeller commented 6 years ago

Maybe hunspell [1] is worth a try. It seems to be still maintained and has a ~400 github stars. There's also a pyhunspell python wrapper [2] for it. Hunspell claims to be used in many applications:

Hunspell is the spell checker of LibreOffice, Mozilla Firefox 3 & Thunderbird, Google Chrome, and it is also used by proprietary software packages, like macOS, InDesign, memoQ, Opera and SDL Trados.

I'm not sure how to verify that or double-check that it's still current, but if this holds true, it might be a good investment.

References: [1] https://github.com/hunspell/hunspell [2] https://github.com/blatinier/pyhunspell

2018-03-01 19:06 GMT+01:00 Doug Hellmann notifications@github.com:

That's unfortunate, and I hadn't seen the notice. Thanks for the heads-up.

I'm definitely open to suggestions for alternative libraries to do something similar, even using different backends.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sphinx-contrib/spelling/issues/13#issuecomment-369678663, or mute the thread https://github.com/notifications/unsubscribe-auth/AHZuFaEB-IQRVXt9aHT0e1DJHfuwI_Y_ks5taDiqgaJpZM4SYHsX .

dhellmann commented 6 years ago

That could be a good option. I don't have a lot of time to work on the change myself, but if you want to work on it I can commit to reviewing the code.

markuszoeller commented 6 years ago

I'm going to spend 3-4h on my Friday afternoon to take a look. Unfortunately, every promise beyond that would be an empty one. I'll update this issue when I have news.

2018-03-02 16:58 GMT+01:00 Doug Hellmann notifications@github.com:

That could be a good option. I don't have a lot of time to work on the change myself, but if you want to work on it I can commit to reviewing the code.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sphinx-contrib/spelling/issues/13#issuecomment-369962510, or mute the thread https://github.com/notifications/unsubscribe-auth/AHZuFVewjzJFHRgMk2ScHXJS2vOUhU7oks5taWw0gaJpZM4SYHsX .

dhellmann commented 6 years ago

No worries, I appreciate any time you do have to put toward it. If not one of us, perhaps another user would have more time.

markuszoeller commented 6 years ago

My current results are below. pyhunspell looks good. I wasn't able to produce a reviewable PR in my time box. Maybe the content below helps others to make progress:

setup

I wanted to have an untainted operating system, so I used a Docker container for my tests:

$ docker run -it ubuntu:16.04 bash

root@1390a024a2be:#  apt update
root@1390a024a2be:#  apt install python-pip hunspell libhunspell-dev
root@1390a024a2be:#  pip install hunspell

spell check

The basic spell check is a True/False validation. Custom words can be added to the runtime dictionary (it doesn't get written into the *.dic file):

>>> import hunspell
>>> 
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> 
>>> hobj.spell("dockerized")
False
>>> hobj.add("dockerized")
0L
>>> hobj.spell("dockerized")
True
>>> 

wordlist

The wordlist must be a dict file, with the number of words in the first line, otherwise this error comes up:

head error: line 1: missing or bad word count in the dic file

So I added the word count manually:

root@1390a024a2be:/spelling# cat wordlist.dic
1
dockerized

root@1390a024a2be:/spelling# 

Now you can add it as person dictionary:

root@1390a024a2be:/spelling# python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> hobj.add_dic("/spelling/wordlist.dic")
0L
>>> hobj.spell("dockerized")
True
>>> 

This means the current way of having a plain wordlist.txt without the word counter is not supported. Maybe that can be changed during runtime, or the add method gets called for every word.

tokenizer

I did not find a tokenizer function in hunspell. Maybe that needs to be implemented in this project here when it doesn't want to use the tokenizer of enchant anymore.

Unfortunately I cannot make any promise to go further from here. IOW, whoever reads this, don't wait for me.

dhellmann commented 6 years ago

Thanks for the detailed notes! It's really helpful to have all of that information.

The difference in the file format probably means we need the extension to read the file and call add() to be backwards compatible. Perhaps if the file has a .dic extension instead of .txt we could skip that and use add_dic() directly.

intgr commented 6 years ago

Meanwhile I've created PR #14 to fix the broken links and add a note that PyEnchant is unmaintained.

jessetan commented 6 years ago

Regarding tokenization: enchant (the C++ library) does not contain a tokenizer, this is provided by pyenchant. The code from pyenchant can probably be copied to this project and the import statement changed

heitorPB commented 5 years ago

pyenchant is now unmaintained. That does not mean it is not working anymore or it will disappear. In https://github.com/rfk/pyenchant/issues/129, the owner says he is looking for a new maintainer. That might be easier than restructuring this sphinx extension.

paddy-hack commented 4 years ago

I just took a peek at the PyEnchant project and it looks very much alive :smile:

jdillard commented 4 years ago

Should this issue be closed considering PyEnchant has been active?

dhellmann commented 4 years ago

Oops, yes, thanks!