haskell / haskell-mode

Emacs mode for Haskell
http://haskell.github.io/haskell-mode/
GNU General Public License v3.0
1.32k stars 340 forks source link

Degrading scroll performance in terminal mode #1751

Open dminuoso opened 3 years ago

dminuoso commented 3 years ago

The Problem

Since I switched to emacs I noticed how scrolling sporadically became worse. At times, it would take 3-4s to redraw 40 lines of code.

Apparent relevant factors

After doing a lot of analyzing I narrowed it down to the following combination of things:

1) Emacs must run in terminal, in GUI either the problem is not as apparent or its non-existant. 2) gc-cons-threshold must be set to a high value. LSP users likely have something like (setq gc-cons-threshold 100000000) 3) haskell-mode and font-lock-mode must be enabled 4) Do some text editing

Watch drawing lines become slower. The presence of constructor names seems especially bad. Note, that the entire line is drawn much slower, so if you vertically tile your emacs to have some unrelated buffer on the right hand side, then the entire line with many data constructors on the left hand window will be drawn slowly.

5) If you then cause a (garbage-collect), scrolling becomes fast again. test.hs.gz

Reproduce

Locally I used emacs 27.1 installed through the emacs nix overlay:

Click to see/use my nixos module

```nix { config, pkgs, ... }: let emacsOverlay = import (builtins.fetchTarball { # Descriptive name to make the store path easier to identify name = "emacs-overlay-2020-09-11"; # Commit hash for nixos-unstable as of 2018-09-12 url = https://github.com/nix-community/emacs-overlay/archive/a69588a3f7de6d68f20cea21562ab7f6f91a400a.tar.gz; # Hash obtained using `nix-prefetch-url --unpack ` sha256 = "180jp31zfrlhnz490x6hckz6z3bvfxwz1613cg4zmf0lrflrl3gp"; }); myEmacs = pkgs.emacs; emacsWithPackages = (pkgs.emacsPackagesGen myEmacs).emacsWithPackages; emacs = emacsWithPackages (epkgs: (with epkgs.melpaStablePackages; [ # melpa stable ripgrep magit # dockerfile-mode # markdown-mode # yaml-mode # json-mode # go-mode # nix-mode use-package projectile # direnv # swiper # ox-gfm # pdf-tools ]) ++ (with epkgs.melpaPackages; [ # melpa unstable solarized-theme gruvbox-theme company counsel counsel-projectile ivy evil evil-collection evil-magit haskell-mode lsp-mode lsp-ui lsp-haskell lsp-treemacs treemacs flycheck ]) ++ (with epkgs.elpaPackages; [ # elpa packages undo-tree ]) ++ [ # main packages ]); in { nixpkgs.overlays = [emacsOverlay]; services.emacs.enable = true; services.emacs.install = true; services.emacs.defaultEditor = true; services.emacs.package = emacs; } ```

  1. Download the attached test file and edit with emacsclient -nw
  2. Ensure you do not have undo-tree-mode enabled. This minor mode causes GCs during the test case, leading to inconsistent behavior.
  3. Set (setq garbage-collection-messages t) to have emacs tell you if a GC is run Note: This is just for your own sanity so you do know if for some reason there's an unexpected GC
  4. Set (setq gc-cons-threshold most-positive-fixnum) Note: We use the highest value to ensure the GC doesn't interfere with our testcase due to some unforseen circumstances. It's an unreasonable value for the real world.
  5. Repeatedly cut+paste the block below the definition of info until the end of the file. Ideally keep a small terminal (say 80x24), this lets you see the redisplay of the definitions of info Attr ... nicely during cut+paste. Repeating this 4-5 times is enough to make performance absolutely unbearable.

Watch the performance degrade with every cut+paste.

  1. Run a (garbage-collect) and watch performance be fast again.

Notes of interest:

  1. If the block at the end is wrapped inside a comment, performance stays stable during cut+paste
  2. If the block at the end is not font locked (as far as I can tell), performance degrades during cut+paste I'm beginning to think there's font-locking on whatever context is there - and the code behind it introduces markers, perhaps per-character or line.
  3. The redisplay degradation is particularly linked to the font locking on data constructors. If we add more data constructors in those definitions of info, then the performance degradation is much worse.
  4. The redisplay performance extends to the entire frame, not just the buffer. Two ways to demonstrate this: If you resize the frame to full screen, watch how every line including toolbar and mode line is drawn slowly too. You can also create horizontal/vertical split windows, and watch how these get redrawn slowly too when the frame is switched between full screen and windowed.
  5. The problem is reproducible with undo-tree-mode present, (setq gc-cons-threshold 100000000) and normal editing of text. Often writing a 60 line region can cause quite bad performance degradation already, it's just more inconsistent. This I want to point out before someone makes a comment about how repeatedly cut+paste on 10k lines of code is not normal emacs usage. This is just a synthesized test case with a clear communicatable description how to reproduce. 6) The emacs profiler does not help at all. 7) During degraded performance, the CPU usage spikes to 100% while scrolling.

Profiler output

If the performance has degraded, running the profiler while scrolling up and down for a minute gives these CPU and Memory reports

CPU Profiler output

Click me to expand!

``` Function CPU samples % - command-execute 3417 99% - call-interactively 3417 99% - funcall-interactively 3284 96% - execute-extended-command 2300 67% - sit-for 2287 66% - redisplay 2287 66% + redisplay_internal (C function) 1 0% + command-execute 12 0% scroll-down-command 534 15% - scroll-up-command 450 13% + scroll-up 1 0% + byte-code 133 3% + evil-repeat-post-hook 1 0% evil--jump-hook 1 0% + timer-event-handler 1 0% + ... 0 0% ```

Memory profiler output

Click me to expand!

``` Function Bytes % - command-execute 8,755,700 99% - call-interactively 8,755,700 99% - byte-code 4,469,695 50% - read-extended-command 4,469,695 50% - completing-read 4,469,695 50% - ivy-completing-read 4,469,695 50% - ivy-read 4,469,695 50% - read-from-minibuffer 2,628,499 29% + ivy--queue-exhibit 2,068,301 23% + command-execute 49,096 0% + redisplay_internal (C function) 12,416 0% + minibuffer-inactive-mode 7,392 0% + # 1,056 0% + undo-auto--add-boundary 1,056 0% + undo-auto--undoable-change 72 0% + ivy--reset-state 256,404 2% + ivy-call 1,056 0% + ivy--update-prompt 1,024 0% - funcall-interactively 4,286,005 48% - execute-extended-command 4,259,293 48% - command-execute 3,791,263 43% - call-interactively 3,791,263 43% - funcall-interactively 3,791,247 43% profiler-stop 3,790,592 43% profiler-start 655 0% + sit-for 273,374 3% + scroll-down-command 15,960 0% + scroll-up-command 10,752 0% + redisplay_internal (C function) 23,224 0% + jit-lock--antiblink-post-command 8,360 0% + timer-event-handler 6,200 0% evil--jump-handle-buffer-crossing 2,112 0% + global-font-lock-mode-check-buffers 1,056 0% + internal-echo-keystrokes-prefix 1,056 0% ... 0 0% ```

GC Output

I first did cut+paste to degrade performance, then executed two (garbage-collect) in a row

The first, which fixed performance, gave us the following output

First GC (After degraded performance): Click me to expand!

``` ((conses 16 4979055 645610) (symbols 48 93078 29) (strings 32 158480 15678) (string-bytes 1 8876291) (vectors 16 92580) (vector-slots 8 1958997 416166) (floats 8 336 1527) (intervals 56 1006222 9317) (buffers 1000 42)) ```

Second GC, immediately again. This is just to observe whether the first GC has collected huge amounts of memory. Note: It did not. The output is very similar to that of the run before.

First GC (After degraded performance): Click me to expand!

``` ((conses 16 4947118 644997) (symbols 48 93078 29) (strings 32 158477 15588) (string-bytes 1 8875403) (vectors 16 92579) (vector-slots 8 1958985 414138) (floats 8 336 1527) (intervals 56 1006220 9301) (buffers 1000 42)) ```

dminuoso commented 3 years ago

It seems that 75% of the time is spent in encode_coding_object

Attached you can find the perf_events report and flamegraph.

perf_report.gz out.svg.gz out.perf.gz

 Samples: 4K of event 'cycles'
# Event count (approx.): 112485425315
#
# Children      Self       Samples  Command          Shared Object          Symbol                                        
# ........  ........  ............  ...............  .....................  ..............................................
#
    99.67%     0.00%             0  .emacs-27.1-wra  [unknown]              [k] 0x41fd89415541f689
            |
            ---0x41fd89415541f689
               __libc_start_main
               main
:
               recursive_edit_1
               command_loop
               internal_catch

               internal_condition_case
               command_loop_1
               |          
               |--77.34%--read_key_sequence
               |          |          
               |           --77.28%--read_char
               |                     |          
               |                      --77.22%--redisplay_internal
               |                                |          
               |                                |--75.83%--update_frame
               |                                |          update_frame_1
               |                                |          update_frame_line
               |                                |          tty_write_glyphs
               |                                |          |          
               |                                |           --75.81%--encode_terminal_code

               |                                |                      --75.76%--encode_coding_object
               |                                |          
dminuoso commented 3 years ago

gprof agrees:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 96.23    160.67   160.67    32275     4.98     4.98  encode_coding_object

5ms per call into encode_coding_object makes for absolute horrid performance I guess.

dminuoso commented 3 years ago

Underlying issue for the test-case appears to be a large accumulation of unreachable markers in the buffer. encode_coding_object traverses the singly linked of markers at least once, potentially twice.

https://git.savannah.gnu.org/cgit/emacs.git/tree/src/coding.c?h=emacs-27.1&id=86d8d76aa36037184db0b2897c434cdaab1a9ae8#n8281

In the degenerate case this singly linked list can contain millions of elements, causing the slowdown.

What is haskell-font-lock.el doing to make this problem so apparent, even when editing smaller code regions?

arjunkathuria commented 3 years ago

Hi. i was asked by the issue author to help with confirming/reproducing the issue.

i'd like to report back that, after following the explicit steps and conditions here i can reproduce the issue on my machine to degrade scroll performance to the point of being sluggish in terminal mode (emacs -nw) (did not test gui)

it seems to get a lot better / came back to normal after i ran (garbage-collect), which also goes with the expected observation in the issue.

my emacs is compiled from source.

i was also asked to provide the output of C-h v system-configuration-options , which is as follows

system-configuration-options is a variable defined in ‘C source code’.

Its value is
"--with-native-compilation --with-json --with-pgtk --with-modules --with-mailutils --without-pop --without-xwidgets"

String containing the configuration options Emacs was built with.

hope it helps!

Edit:- My exact emacs version would be:-

This is GNU Emacs 28.0.50 (build 3, x86_64-pc-linux-gnu, GTK+ Version 3.24.29, cairo version 1.17.4)
 of 2021-06-04
Copyright (C) 2021 Free Software Foundation, Inc.
dminuoso commented 3 years ago

I suspect this to be the culprit:

https://github.com/haskell/haskell-mode/blob/b35e8547b67f3d2c4d38ec7e7e516e3c0925352f/haskell-lexeme.el#L86-L131

Is it possible that the frequent call of this during font locking generates a lot of match data with markers, left for GC? If that's the case, why don't I notice the scrolling problem in GUI mode?

dminuoso commented 3 years ago

It seems that in GUI mode there's never a call into encode_coding_object.

So to summarize: It seems that we generate a lot of unreachable markers in haskell-lexeme.el with looking-at haskell-lexeme-looking-at-qidsym. These markers cause a slow-down in encoding_coding_object which traverses the list of markers once or twice. That function is then called on redisplay on terminal mode, causing a delay on drawing each line. However, the function is never called in GUI mode.

A garbage collect would instantly bring the performance back to normal, but especially lsp-users are likely to want a high gc-cons-threshold. Also, generating this much markers causes a noticeable increase in GC delay.

Normally this wouldn't bother me as much, but I regularly trip into this the problem while doing very normal code editing - granted the scrolling speed does not get as bad as in the text-case after 4-5 cut+paste, but it's noticably sluggish nevertheless.

Assuming my hypothesis is correct:

As a user workaround I can see

  1. Reducing gc-cons-threshold
  2. Using something like gcmh
  3. Switching to GUI mode
mattiase commented 3 years ago

I can reproduce the slow-down, and the first thing to try would indeed be to cut down on the generation of markers – haskell-lexeme.el generates way more markers than necessary. haskell-lexeme-looking-at-token may be the worst of the lot.

Note that the match data is not stored as markers internally in Emacs, only as positions (integers) and an optional reference to the buffer to which they apply (unless the last match was against a string). This means that (match-data) will reify each position as a marker if the last match was against a buffer, which is quite a waste if done frequently. Using a non-nil INTEGERS argument eliminates that (at the cost of getting the buffer as an extra element at the end, but conses are much cheaper).

It also means that set-match-data can safely be called with a list of integers – there should be no need to create markers just to set the match data.

More fundamentally, I'm not sure if this obsessive use of mutating the match data everywhere to return information really is the best design. I'd just return a list like (TOKEN START END ...) etc, and use pcase to pattern-match on it. It would probably even be faster.

purcell commented 3 years ago

Thanks @mattiase, that makes sense. I'm probably the most active maintainer currently, but still not very active: I'm unlikely to have time to be able to integrate these suggestions myself, but I'd be happy to help shepherd a PR into the repo.

mattiase commented 3 years ago

Two reasons why you shouldn't apply that PR I just wrote: it hasn't been tortured properly (leaving that to you), and by doing so we may never find and fix the performance bug in Emacs that is triggered by having large amounts of markers in a buffer.

mattiase commented 3 years ago

For the record, the performance problem has largely been solved in Emacs 28. Since it won't be released for a while, it probably makes sense to do something along the lines of PR #1760 as well.