rrthomas / enchant

enchant spellchecking library
http://rrthomas.github.io/enchant/
GNU Lesser General Public License v2.1
347 stars 60 forks source link

Remove ispell pipe mode (`enchant-2 -a`) #393

Open rrthomas opened 3 months ago

rrthomas commented 3 months ago

Emacs is the only user, and with the jinx package, no longer needs it.

astoff commented 3 months ago

I'm afraid this is a bit of a misunderstanding. Jinx is just one of the Emacs spell-checking packages, with its own advantages and disadvantages. Note also that any spellchecking package that doesn't rely on the -a interface necessarily has the disadvantage of requiring to link Emacs against some library or compile an external module.

rrthomas commented 3 months ago

@astoff Thanks for your feedback. It's not a misunderstanding: I wrote the Enchant support for ispell.el and use it daily.

I agree that there are advantages to not needing to compile against an external library, but I also believe these are largely outweighed in the case of Enchant: the pipe method is slow and limits access to Enchant functionality. It also causes me a maintenance burden.

rrthomas commented 3 months ago

I have filed minad/jinx#199 and started using Jinx myself to explore how far we can go on the Emacs side. Certainly Jinx was easy to set up and use.

rrthomas commented 1 month ago

It is worth considering the interface of nuspell(1) to copy for a simplified enchant(1).

astoff commented 1 month ago

What use do you have in mind? IMHO ispell -a is the only that that anyone might care about, and the nuspell command (kinda like but not quite the same) is pretty useless.

rrthomas commented 1 month ago

Mostly diagnostics, also simple batch-mode checking. As you say, not that useful. But I don't think there's much use for pipe-based checking any more.

astoff commented 1 month ago

Just for the record, I wrote an Emacs package last year that does pipe-based checking and in my book it works very well (in fact, it served as inspiration for Jinx).

Also, apart from figuring out the async communication (which as always was tricky), it is a simpler package than Jinx, and doesn't require compiling a module.

Also, as a design decision, my package doesn't dabble into tokenization questions, which is language dependent, and leaves it entirely to the spellchecker. Because of https://github.com/rrthomas/enchant/issues/316, enchant doesn't work well with my package, so I've been using hunspell instead. (Was that issue solved? I can't reproduce it right now). If I was using enchant, though, I'd protest vehemently :-).

rrthomas commented 1 month ago

Also, as a design decision, my package doesn't dabble into tokenization questions

This is interesting, because I'm not clear that Enchant should be doing this either. While it is reasonable to think that Enchant could do it when fed plain text, I don't think that makes sense for most applications (which aren't editing plain texts), which already have to know where the words are.

astoff commented 1 month ago

I'm not sure either, but the rules for what constitutes a word are language-dependent (see e.g. /usr/share/hunspell/*.aff) and I would argue that belongs to the spell checker, not to the application using it. There are lots of tricky details, such as when surrounding punctuation is part of the word.

Certainly there is a part of the story that belongs to the application. For example, restrict to comments in source code or filter out markup commands.

rrthomas commented 1 month ago

The problem is that the application will typically also be involved in deciding what is the text that has to be checked, according to non-textual markup. It has to work out all this stuff anyway, e.g. for layout purposes.

astoff commented 1 month ago

Sure, and that's true even when there's no visual markup, like I said in the second paragraph of my previous message.

But deciding what constitutes a word (or in general any information encoded in the dictionary aff file) is a different story.

rrthomas commented 1 month ago

By "all this stuff", I meant that applications requiring spell-checking will already have to work out where word boundaries are, for layout purposes. So it's not a different story, it's exactly the same story, I'm claiming.

astoff commented 1 month ago

How is the application supposed to know if a hypen is a word boundary (well-known) or part of a fixed expression (vis-à-vis)? Splitting on whitespace or something like that isn't exactly the same story as splitting words.

minad commented 4 weeks ago

I think what is needed is some API where the application can ask Enchant about language-dependent tokenization rules. There is already enchant_dict_get_extra_word_characters but this is probably too limited. The bulk of the tokenization work is still better done on the side of the application, since it knows about the relevant context sensitive information, the markup format, comment syntax, etc.

Regarding the pipe mode, I think using it for a just in time Emacs spell checker was the wrong decision, @astoff. The performance is poor, otherwise I wouldn't have made Jinx. In contrast, interacting with the Enchant API is completely painless, except for the module compilation process. But then I think Elisp should provide an FFI at some point which would completely alleviate this pain.

astoff commented 4 weeks ago

The performance is poor

Well, I still can't reproduce the poor performance you got, but also I think it would be very sad if there were any fundamental reason for it to be that way. It would mean LSP cannot work smoothly either.

In contrast, interacting with the Enchant API is completely painless, except for the module compilation process.

Sure, I have no doubt about that ;-)

I think what is needed is some API where the application can ask Enchant about language-dependent tokenization rules.

I think that's how most APIs (e.g. Hunspell) work, actually. It seems to me it's not general enough (for example in English you need to expresses that ' is part of a word when inside, but not when around a word). To me it would seem to make more sense if the API allowed the application just split on whitespace. So you can ask if "well-knownx" is correct and the answer would not be yes or no, but rather that there's a typo form characters 5 to 11.

minad commented 4 weeks ago

Well, I still can't reproduce the poor performance you got, but also I think it would be very sad if there were any fundamental reason for it to be that way. It would mean LSP cannot work smoothly either.

You probably didn't sufficiently profile your package, since jit-spell appeared prominently in the profiles in my test, which is already completely unacceptable for an always-on minor mode, even if not noticeable. Given all these complaints about slow Emacs, the recipe for that are expensive mode lines and slow minor modes, which are always on. I run EXWM and everything runs inside Emacs, so I may be more affected. Also I assume that the effect becomes more noticeable on slower systems. Now contrast this with Jinx, which doesn't add much cost at all to the profile. IPC is simply more expensive than a single C function call plus some FFI marshaling. Why are we even arguing this?

Regarding LSP - LSP in Emacs had massive problems due to the IPC and JSON parsing, this is for sure. See the lsp/eglot-booster projects and also the new JSON parser in Emacs 30. In any case, for LSP we get some benefit by using the IPC protocol due to the various servers and simply because the protocol already exists in this form and cannot be replaced easily. If LSP servers could run inside the host process without the expensive communication one could save some latency, but the downsides would be more significant given the complexity of LSP. In the case of spell checking, the API essentially boils down to a single function to check a word plus a little bit of setup. There the trade-offs are different and Enchant already provides a nice abstraction to embed the spell checker inside the host program with very little hassle. So this is just an apples to oranges comparison with different answers.

Just to be clear regarding this discussion - I am neutral. I have need for the pipe mode but I had assumed that the pipe mode would also be useful for other software, which would speak in favor of keeping it. Also there is jit-spell. Then I've learned that @rrthomas only added the pipe mode for ispell.el integration, so it is up to him to consider removing it again if the maintenance effort is not justifiable.

astoff commented 4 weeks ago

This is not a matter of profiling. The issue you observe I can't reproduce, and that's the problem. I did profile quite a bit and say on an Org mode buffer I always I see something like this, which says all is fine:

        1328  71% + redisplay_internal (C function)
         194  10%   Automatic GC
         135   7% + command-execute
         112   6% + org-appear--post-cmd
          40   2% + jit-spell--process-filter
          33   1% + timer-event-handler
           9   0% + winner-save-old-configurations
           2   0% + internal-echo-keystrokes-prefix
           2   0%   internal-timer-start-idle
           2   0%   undo-auto--add-boundary
           1   0%   org-appear--pre-cmd
           1   0%   tooltip-hide
           0   0%   ...

Also I assume that the effect becomes more noticeable on slower systems.

In the absence of further information I would assume a slow system is slower overall. I would be very interested in getting to the bottom of this, because it must be due to some setting or build configuration. But obviously I have no access to your computer to further investigate, neither did I get any similar bug report.

[I also have no complaints regarding LSP — or rather, I do notice some latency but it doesn't noticeably slows down my Emacs, which is in line with my jit-spell experience.]

Concerning the ispell -a interface, I just came here to say that Emacs still uses it. Not with the intent to advocate for anything, but just to let @rrthomas take an informed decision (it turn out it was unnecessary to mention anything, since he was aware of the situation already.)

minad commented 4 weeks ago

@astoff

This is not a matter of profiling. The issue you observe I can't reproduce, and that's the problem. I did profile quite a bit and say on an Org mode buffer I always I see something like this, which says all is fine:

The profile you showed already demonstrates the problem. It is clearly not fine. If you do the same with Jinx it won't appear with 2% in the profile. 2% is a lot for the process filter only to send a few words back and forth, and it will even get up when scrolling, and the backend process cost is not even included yet. Anyway, I am not interested in discussing this further given your ignorant position. There is simply nothing wrong on my system, nothing to diagnose, nothing to get to the bottom of, except that I don't have the most powerful machine - still no latency or performance issues in my current setup without jit-spell. Look, you know that I am experienced enough given my other packages, long time programming and Emacs development experience, so you could also trust me on this, instead of repeatedly claiming that there is no problem, despite the profile on your machine already showing some impact. I gave your package a fair and long try. In the beginning it worked well, but after short time of usage I found that latency went up unacceptably. I tried to make suggestions regarding how the efficiency could be improved via some rate limiting and queuing to reduce the pressure on the inefficient Emacs process filters, which you were not interested in. Why should I trade this given that Jinx can just do without any increase in latency? Jinx started as an experiment and it immediately solved the problem, otherwise I wouldn't have pursued it further.

[I also have no complaints regarding LSP — or rather, I do notice some latency but it doesn't noticeably slows down my Emacs, which is in line with my jit-spell experience.]

When reading around one will see reports about LSP performance and latency issues. It depends on the server you are using, and I assume that you only use good ones which keep the amount of data over IPC to a minimum. Of course you also have the experience to diagnose given that you've even authored your own server.

astoff commented 4 weeks ago

Man, you really went ballistic in that message. I don't quite see what you think you need to defend. It's obvious that a module is way faster than IPC, and Jinx is probably a great package.

We should indeed stop discussing this, but I do want to correct your point that I am

repeatedly claiming that there is no problem

To quote myself, “the issue you observe I can't reproduce, and that's the problem”. Rephrasing more straightforwardly, the problem is that I can't reproduce the issue you observe. Please don't read in that assertion more than is written. I'm just saying that unfortunately I couldn't work with the info you provided me, despite the effort of both parts.

minad commented 1 week ago

@astoff I apologize. I was frustrated. You do not observe the performance issues I am observing, while you generally seem to agree that performance must be better if we are cutting out IPC. Of course this only holds as long as Enchant is quick enough with its synchronous responses. In my usage it is, but things can also easily go wrong as in https://github.com/rrthomas/enchant/issues/386. Enchant could even crash, which fortunately has never happened for me. So I believe we are at least theoretically on the same page, while on the practical level regarding the practical performance implications we are not in agreement. In my experience, Emacs quickly gets sluggish if there is too much idle work. I've always perceived Emacs performance as bearable, but not great, so I use as few inefficient modes in the background as possible. I am certainly not the only one who feels like this. For example I cannot use one of these nice and fancy looking, but inefficient mode lines, which likely double redisplay time. Maybe for you the situation is different and you've never had anything to complain about Emacs performance. But maybe you've also observed the performance jumps between 27, 28 and 30? I don't recall anything significant with 28 -> 29, but the other two jumps seemed to be observable. But anyway, I doubt that any of this off-topic discussion matters for @rrthomas regarding his decision about pipe mode. I think it is worth keeping pipe mode just for compatibility with other programs which use this, some which we may not even be aware of. But I also understand that every piece of code has some implied maintenance burden. Sometimes the burden may be acceptable for existing code which works and is well tested and widely used, but this may not be the case here. In my experience, one should be particularly careful about adding new features which may bring new problems. The weight of new features should be checked carefully.

astoff commented 1 week ago

Hi Daniel, thanks for the kind reply. It's indeed interesting how there's a lot of back and forth in those performance issues you linked. It's not always easy to reproduce them.

Clearly the Emacs IPC isn't great. Perhaps a point where we differ is, I think it could and should be improved and "asyncio" via cooperative multitasking ought to work well (people often say they want multithreading in Elisp, but I'm a bit skeptical... how do you use it in a program with so much global state?). But then again, even if I find it cool to make Emacs talk to subprocesses, I see that spellchecking is kind of a special case, because it's high frequency needs to be very fast.

(In the case of spell-checking, even putting the IPC question apart, note that ispell -a unconditionally computes corrections for each misspelling, which is much more expensive than just checking, and, more often than not, not necessary.)

minad commented 1 week ago

Clearly the Emacs IPC isn't great. Perhaps a point where we differ is, I think it could and should be improved and "asyncio" via cooperative multitasking ought to work well (people often say they want multithreading in Elisp, but I'm a bit skeptical...

I am also skeptical. Imho the right way for multi threading would be to add separate worker jobs and to improve IPC and async story to talk to external processes and workers (futures, async/monadic sequencing macros, etc).

But then again, even if I find it cool to make Emacs talk to subprocesses, I see that spellchecking is kind of a special case, because it's high frequency needs to be very fast.

Yes, this was somehow my point. I believe that the relative IPC overhead is particularly high small amounts of data are sent frequently and separately. In contrast, if a large amount of data is sent in bulk, Emacs can handle it acceptably well. See also read-process-output-max which was recently increased and which should probably be even higher if a lot of data is sent. Also one may want to disable process-adaptive-read-buffering. I temporarily adjust both values in Consult to improve ripgrep output performance.

(In the case of spell-checking, even putting the IPC question apart, note that ispell -a unconditionally computes corrections for each misspelling, which is much more expensive than just checking, and, more often than not, not necessary.)

Oh, I didn't know that. I think this also explains why I've observed a relatively high load due to the spell checker process alone (only a few percent, but still).