tmalsburg / guess-language.el

Emacs minor mode that detects the language you're typing in. Automatically switches spell checker. Supports multiple languages per document.
119 stars 14 forks source link

Very slow in Org buffers #17

Closed joostkremers closed 7 years ago

joostkremers commented 7 years ago

Hi,

I've been experiencing a terrible slow-down in Org buffers recently, especially when inside tables, but also just moving the cursor around partially collapsed headings. A quick profiling showed that guess-language is the culprit, especially the call to how-many in guess-language-region:

- flyspell-post-command-hook                                     2958  89%
 - flyspell-word                                                 2958  89%
  - flyspell-highlight-duplicate-region                          2940  88%
   - run-hook-with-args-until-success                            2940  88%
    - guess-language-function                                    2940  88%
     - guess-language                                            2940  88%
      - guess-language-region                                    2883  87%
         how-many                                                2883  87%
      + backward-paragraph                                         49   1%
      + forward-paragraph                                           8   0%
  + org-mode-flyspell-verify                                       18   0%
+ command-execute                                                 323   9%
+ yas--post-command-handler                                        14   0%
+ redisplay_internal (C function)                                   9   0%
+ timer-event-handler                                               3   0%
+ ...                                                               0   0%

It seems that the longer the Org file, the bigger the slow down. I'm guessing that this may be caused by the fact that backward-paragraph in an Org buffer may travel very far back: in one particular Org file of mine, it moves almost all the way to the beginning of the buffer.

I tried the obvious thing, i.e., use org-backward-paragraph and org-forward-paragraph in guess-language-paragraph if major-mode is org-mode, but that didn't seem to have much of an effect. Perhaps you know of a better way to deal with the issue?

tmalsburg commented 7 years ago

I'm not experiencing any notable slowdowns even in large org files. how-many is just counting matches of a regular expression and that shouldn't be slow even with larger texts. I think there may be another issue that is causing these slowdowns. We could certainly introduce a work-around (e.g. use either the paragraph or the last N words, whatever is shorter) but it would be better to track down the true cause of this issue. The fact that you experienced these slowdowns only recently also suggests that the problem may not be rooted in guess-language because I didn't make any changes over the last two months.

joostkremers commented 7 years ago

I'm not experiencing any notable slowdowns even in large org files.

Yeah, I was afraid of that. ;-)

BTW, this is with org 9.0.5. Not sure if that makes a difference.

how-many is just counting matches of a regular expression and that shouldn't be slow even with larger texts. I think there may be another issue that is causing these slowdowns.

My investigations suggest otherwise. I added a call to message to guess-language-region to output the beginning and end of the region, and I generally get fairly outrageous numbers, such as guess-language-region entered: 11100 54425. Plus the profiler output suggests that how-many is indeed slow when run against such a large region.

Note that this is primarily a problem in tables, though it there is also some slowdown when navigating collapsed headers. In normal text, it isn't nearly as bad.

The fact that you experienced these slowdown's only recently also suggests that the problem may not be rooted in guess-language because I didn't make any changes over the last two months.

Well, I haven't been using any Org files for the past two months, so that may also be a reason. ;-)

The point is that disabling guess-language-mode makes the problem go away. Also, after changing the definition of guess-language to:

...
(let ((beginning (max 0 (- (point) 100)))
      (end       (min (point-max) (+ (point) 100))))
...

the slowdowns disappeared.

So, I believe the facts clearly indicate that guess-language-region should not be called on a region that is too large, and 40000+ characters is certainly too large. It could well be that there is some idiosyncrasy in my Org files that makes guess-language use such large regions, but regardless, I think it would be best if guess-language-region would guard against being called on regions that are too large.

tmalsburg commented 7 years ago

Sorry, I should have been more precise. I wasn't suggesting that this slowdown has nothing to do with guess-language at all. Clearly the long processing time is incurred in how-many which is called by guess-language. However, that doesn't mean that the cause of the problem is necessarily in guess-language. If we get outrageous number for the start and end of the paragraph that suggests that the problem is probably in the code for detecting paragraphs, i.e. outside guess-language's responsibility. This is also supported by the fact that you still have outrageous numbers even when you're using org-backward-paragraph and org-forward-paragraph. Unless your document consists of one or multiple gigantic paragraphs that shouldn't happen.

If there is a problem with paragraph detection, that should be fixed first, not at least because it will likely create other problems than just slowing down guess-language. You say that guess-language should guard against using regions that are too large, but that's exactly what I wanted to achieve when I decided to do detection on a by-paragraph basis. Perhaps, additional guards are necessary but I wouldn't want to add a workaround for an issue that may be specific to your particular setup. So let's find out why you're getting these ridiculously long paragraphs and we can go from there.

joostkremers commented 7 years ago

First off,, I should note that when I tried using org-(forward|backward)-paragraph in guess-language to remedy my problem, I changed the wrong function (guess-language-paragraph rather than guess-language). If I change the right function, the problem is indeed solved.

Basically, what happens is this: if you have something like the following in an Org file:

* Some Heading
Some text.
** Another Heading
| A | Table |
|---|-------|
|   |       |

and point is somewhere in the table, then backward-paragraph moves to the beginning of "Some text". But if instead you have:

* Some Heading
- Some list item
** Another Heading
| A | Table |
|---|-------|
|   |       |

then backward-paragraph will skip over "- Some list item".

Crucially, backward-paragraph also skips over headings and the beginning of tables. So in this particular example, backward-paragraph would move point to the beginning of the buffer.

The Org-specific functions, org-(forward|backward)-paragraph do stop at headings, list items and tables. So in the second example, with point being in the table, org-backward-paragraph moves point to the beginning of the table.

This means that if you have an Org file that consists solely of headings, lists and tables, backward-paragraph will move point to the beginning of the buffer. And it just so happens that this is the case in the Org file I'm experiencing slowdows in. It doesn't have much "normal" text (or even none at all), so every time guess-language is called, it basically checks the entire buffer.

In other words, not all text formats can use the default (forward|backward)-paragraph functions. Given that fact, I think (forward|backward)-paragraph aren't the best choice to ensure that the region to be checked isn't too large.

tmalsburg commented 7 years ago

So could we fix this by simply using org-backward-paragraph instead of backward-paragraph and likewise for forward-paragraph whenever we're in an org buffer?

joostkremers commented 7 years ago

So could we fix this by simply using org-backward-paragraph instead of backward-paragraph and likewise for forward-paragraph whenever we're in an org buffer?`

For Org buffers that should be enough, yes. There's of course the theoretical consideration that other text modes may also not work with the standard (forward|backward)-paragraph functions, but if such an issue ever comes up, you could deal with it then.

tmalsburg commented 7 years ago

See 2fd50238e1b30603754497195b6411c8996cb769 and let me know if this works for you. I have to say, I'm not sure that this is the correct solution. The granularity is now really fine. For example, each item in a list is a paragraph and most of my list items do not provide enough material for reliable language identification. So we may have to change this to something more sophisticated, e.g., use backward/forward-paragraph unless gives us an insanely large region and only then fall back to org-backward-paragraph. Or use the default paragraph in org but never more than the current subtree.

joostkremers commented 7 years ago

That should work, I currently have something similar in a local copy of guess-language. But you're right that the granularity is probably too small now. I had another approach at first:

https://github.com/joostkremers/guess-language.el/commit/b2474dbf301249c4337bc8b3fb9cbb6bc383bce7

Something like that could perhaps be combined with (forward|backward)-paragraph to guard against backward-paragraph moving too far.

tmalsburg commented 7 years ago

You're using the last 100 characters, right? By-paragraph language detection is a feature and this would break it. Specifically, when point is at the beginning of a paragraph you would effectively guess the language of the last paragraph not the current one. 100 chars may also be too little for a reliable guess. I think what I will do is to use org-backward/forward-paragraph unless we're in a list in which case I will use org-beginning/end-of-item-list; something like that.

joostkremers commented 7 years ago

Yeah. I was thinking you could do something like

(max (save-excursion (backward-paragraph) (point)) (- (point) 100))
...

Or whatever value instead of 100 would make sense. But that would still mean that larger paragraphs aren't tested entirely. Not user if that's an issue.

tmalsburg commented 7 years ago

The paragraph language would depend on the position of point and I'm not sure that is a desirable property. The main language of a paragraph is what it is independently of the point.

I changed the code such that org lists are treated as paragraphs: 8c8a1616b6a7bc4c10942ee0a1b2591b98fcd493 That should work ok in most practical cases.

joostkremers commented 7 years ago

Thanks. There don't seem to be any slowdowns. I'll let you know if I run into any trouble, but I think we can close this.

manuel-uberti commented 7 years ago

Sorry for bringing up this issue again, but I am experiencing the slowdown in Org buffers.

I am using:

Disabling guess-language-mode in the Org buffer removes the slowdown. Also, it happens every time, without taking into consideration buffer size.

If you need me to do some tests or provide extra information, please feel free to ask.

tmalsburg commented 7 years ago

If it happens even in small buffer, this is probably a different problem, but let's find out. Could you please use the function below to see what region guess-language is using for detection?

(defun guess-language-current-region ()
  (let ((beg (save-excursion (guess-language-backward-paragraph) (point)))
        (end   (save-excursion (guess-language-forward-paragraph) (point))))
    (move-overlay mouse-secondary-overlay beg end)
    (message (format "Region beg: %d Region end: %d Region length: %d" beg end (- end beg)))))

Place point on a paragraph where you experience slow detection and then call this function. It shows the region coordinates in the mini buffer and also highlights the region that would be used for detection. Are these regions excessively large?

joostkremers commented 7 years ago

FWIW, I'm running the latest version of guess-language as well and haven't seen any slowdowns anymore. So it most likely is a different issue.

manuel-uberti commented 7 years ago

@tmalsburg this is an example, if you need more just ask:

Region beg: 785 Region end: 802 Region length: 17

@joostkremers I'm running latest guess-language as well, but only when I disabled it in the current buffer the slowdowns don't happen.

tmalsburg commented 7 years ago

So guess-language is running on just 17 characters. Since this is shorter than the minimal paragraph length ( guess-language-min-paragraph-length), guess-language should actually not do anything at all. Could you run M-x guess-language at the same position and check how long that takes?

manuel-uberti commented 7 years ago

It hangs indefinitely.

tmalsburg commented 7 years ago

Ah, I think it's a corner case that I forgot to handle in the latest commit. Do you have a list at the very beginning of the document?

manuel-uberti commented 7 years ago

Yes, and also throughout the whole document. An excerpt:

* Super to open Dash
- sudo apt remove dell-super-key

* Swap TAB with CTRL
** dconf > org > gnome > desktop > input-sources
- set xkb-options to ['ctrl:swapcaps']
tmalsburg commented 7 years ago

Hm, there clearly is a bug that is triggered when you have a plain list at the very beginning of the document. However, later in the document this bug shouldn't cause any problems. So if you experience this issue everywhere in the document, there must be something else going on. It's going to be difficult to track this down if I can't reproduce the problem. Could you please try to come up with a minimal working example that reproduces the problem (emacs -q ...)?

tmalsburg commented 7 years ago

The hang-up in buffer-initial plain lists should be fixed now. 2bc0e1f9c8947b9b5ac8d792bd7f6d2c36d294ab

manuel-uberti commented 7 years ago

Thanks for the commit and the explanation.

I can't reproduce it using emacs -Q, so it's definitely something in my configuration. Consider this close, then, and thanks again for the kind support.

tmalsburg commented 7 years ago

Well, it's still possible that guess-language interacts with other packages in a very unfortunate way. In this case, we should make changes to prevent this from happening. So if you find out what's going on, please let me know. Thanks! (Closing for now.)