"Paragraphs" -> Lines ?

greghendershott commented 2 years ago

Although this is maybe the wrong repo, hopefully that's OK because this is more of a quick question? Depending on answer I can close or move to drracket.

Could we:

Add some methods like beginning-of-line and end-of-line that are position -> position.
Document that drracket:indentation/drracket:range-indentation functions should use those.
Document a text-like will not necessarily implement the "paragraph" methods (or if that's too drastic, document that they will always be available... but might be quite slow).

?

Why I ask:

The "paragraphs" terminology has always been confusing to me. AFAICT in practice this actually means "lines".
position-paragraph is a performance bottleneck. Effectively it's "give me a line number" -- which is inherently slow -- but AFAICT indenters only need it to give to paragraph-{start end}-position.
If indenters have a position, they could give that to a {beginning end}-of-line which could efficiently just look backward/forward for a newline. No need to count "paragraph" numbers from the beginning to N, no need to maintain and smartly update some auxiliary data structure to make that less horribly slow.
Some existing indenters might compare "paragraph numbers", but AFAICT they're really asking "are these two positions within the same line". Instead you could answer that comparing beginning-of-line for two positions.
I'm thinking of this because in my "like like-text%" so far I've punted on attempting an update of "paragraphs" to be as smart/fast as that for the tokens. Although "dread" is too strong, I have a hunch this will be non-trivial -- probably a lot of effort/bugs vs. giving indenters line methods that directly answer what they want to ask?

greghendershott commented 2 years ago

p.s. I hope that didn't sound cranky/complainy! I don't know the history of this; there are probably good reasons for "paragraphs" terminology. And it makes sense than an editor object would supply line numbers. I'm just looking at indenters and lines and asking if that could be supplied more simply/directly.

rfindler commented 2 years ago

Mainly just commenting on the one bullet:

The "paragraphs" terminology has always been confusing to me. AFAICT in practice this actually means "lines".

I agree that the paragraphs/lines terminology in text% is confusing but if you look at it from the perspective of a WYSIWYG editor (where newlines terminate paragraphs and lines, at least in the GUI, are soft breaks inserted by some algorithm) those are reasonable terms.

I would suggest that, if we move away from the paragraph terminology, we pick a word that isn't "line" so as to avoid confusion with text%.

The rest of this sounds really great!

greghendershott commented 2 years ago

That's a great point about "paragraphs", and soft/visual line breaks vs. hard line breaks.

In the specific case of an identer, AFACIT it's only concerned with hard line breaks; "lines" in the sense of e.g. file->lines and #\newline.

Would it help if the method names were something like {beginning end}-of-hard-line? I don't have any better ideas atm, or strong opinions here.

greghendershott commented 2 years ago

To be clear I don't really care if the new methods have "paragraph" in the name.

So if you prefer (say) position->{start end}-of-paragraph I'd be fine.

The main point for me is the method signature position -> position (as opposed to paragraph-number -> position).

mfelleisen commented 2 years ago

It seems to me that “paragraph” means “visually-separated-block” (of hard and soft lines). A long name (not for @rfindler) but perhaps more intuitive than paragraph.

greghendershott commented 2 years ago

What I propose (though not necessarily the names) is really just this:

...

    ;; Faster alternative to the "paragraphs" methods, for use by indenters.

    (define/public (beginning-of-line pos)
      (define len (string-length str))
      (let loop ([pos pos])
        (cond [(<= pos 0) 0]
              [(<= len pos) (loop (sub1 len))]
              [(char=? #\newline (string-ref str (sub1 pos))) pos]
              [else (loop (sub1 pos))])))

    (define/public (end-of-line pos)
      (define len (string-length str))
      (let loop ([pos pos])
        (cond [(< pos 0) (loop 0)]
              [(<= len pos) (sub1 len)] ;implicit at end
              [(char=? #\newline (string-ref str pos)) pos]
              [else (loop (add1 pos))])))
...

(module+ test
  (let ()
    (define o (test-create "0\n234\n6\n8\n\n"))
    (check-equal? (send o beginning-of-line 0) 0)
    (check-equal? (send o beginning-of-line 1) 0)
    (check-equal? (send o beginning-of-line 2) 2)
    (check-equal? (send o beginning-of-line 3) 2)
    (check-equal? (send o beginning-of-line 4) 2)
    (check-equal? (send o beginning-of-line 5) 2)
    (check-equal? (send o beginning-of-line 6) 6)
    (check-equal? (send o beginning-of-line 7) 6)
    (check-equal? (send o beginning-of-line 8) 8)
    (check-equal? (send o beginning-of-line 9) 8)
    (check-equal? (send o beginning-of-line 10) 10)
    (check-equal? (send o beginning-of-line 11) 10)
    (check-equal? (send o beginning-of-line 1000) 10)
    (check-equal? (send o end-of-line -1) 1)
    (check-equal? (send o end-of-line 0) 1)
    (check-equal? (send o end-of-line 1) 1)
    (check-equal? (send o end-of-line 2) 5)
    (check-equal? (send o end-of-line 3) 5)
    (check-equal? (send o end-of-line 4) 5)
    (check-equal? (send o end-of-line 5) 5)
    (check-equal? (send o end-of-line 6) 7)
    (check-equal? (send o end-of-line 7) 7)
    (check-equal? (send o end-of-line 8) 9)
    (check-equal? (send o end-of-line 9) 9)
    (check-equal? (send o end-of-line 10) 10)
    (check-equal? (send o end-of-line 11) 10)
    (check-equal? (send o end-of-line 1000) 10)))

I'm not sure how to "port" this to an implementation for... I'm not even sure which class it is, editor%? If one of you knows how to do that in New York minute I'd be happy to let you. Otherwise glad to take a crack at it myself. Just tell me the method names please. :smile:

mflatt commented 2 years ago

I think the change would be at color:text-mixin to add the methods there (implemented with the existing paragraph methods). They would also be added to color-textoid<%> while removing the three existing paragraph methods. The new synatx-color/racket-indentation function would need to be updated, as you probably already know.

For names, I would go with position-paragraph-start-position and position-paragraph-end-position to stick with the existing text% terminology, but if those are too long, I wouldn't object to different names.

I can make these changes, but I'd prefer to wait until later in the day so we can first deal with any fallout from the changes so far (as discovered by DrDr, for example).

rfindler commented 2 years ago

Any name is good for me, subject to the constraints discussed. I agree that the name isn't the main point and I also agree there appears to be performance benefits to the revised API (not when you are already in a text% object but c'est la vie).

Robby

mflatt commented 2 years ago

I started looking at this, and now I'm not so sure. The existing navigation and indentation code could work in terms of line starts, but it's messy in at least once place (where the shrubbery indenter really does want to iterate over lines), and I worry about breaking things in other places. I think it would be a similar amount of work and tidier overall to help make position-paragraph and similar fast in "like like-text%".

greghendershott commented 2 years ago

Thank you for looking at this. I had planned to look at it myself, expecting it would be straightforward for the racket and at-exp indenters. But if I had gotten as far as the shrubbery indenter it sounds like I would have hit that thicket (pun kind of intended) you mention, and if you find it messy, I'd have been really challenged. That sounds discouraging.

I guess my intuition might be unduly influenced by things like text editors, where, AFAIK, people creating indenters historically don't want to rely on line numbers because it could be too slow.

So, for example, if someone wanted to create an indenter for a shrubbery lang for (say) neovim, and couldn't use syntax-color-lib directly or even indirectly via some LSP thing... they would probably be writing Lua code (I believe) to attempt the messy part you mentioned, probably without relying on line numbers. I think??

Even if that's correct, I don't know if that kind of scenario changes your opinion at all.

Assuming not: I can look at implementing line numbers with an efficient way to handle the case where I stop re-tokenizing because the new tokens have converged to be the old tokens merely shifted in position. I had started to do that, before pausing to ask "hey is this even really necessary", but I can resume that.

mflatt commented 2 years ago

If I understand, the "like like-text%" class is in charge of representing content, so it can intercept inserts and deletes to update a line map with a suitable data structure — right? I didn't find a suitable data structure immediately among our libraries, but I feel like I probably overlooked one. Meanwhile, I ended up playing with one (adapted from an AVL-tree implement), and I can provide it if that's helpful.

Overall, since at least one indenter needs line numbers, and since it's a general concept that seems likely useful elsewhere, I'm still in favor of having "like like-text%" support it so that there's one implementation.

It's surprising to me that a position<->line number lookup is not generally available in editors, so I can understand where you're starting from, but it doesn't seem like that has to drive our decision here.

greghendershott commented 2 years ago

Maybe position-paragraph doesn't need any special data structure to be updated. Maybe it just needs the content string.

;; A 5,000 line file, about 300,000 chars long
(define content
  (for/fold ([s ""])
            ([n (in-range 5000)])
    (string-append s "\nafd;lkajsdf;lkajsfd;lkajsdf;lakjdsfl;kajsdfadsfasdfasdfasfd")))

(define (position-paragraph pos)
  (define len (string-length content))
  (let loop ([ix 0] [num 0])
    (cond [(or (= ix pos)
               (= ix len))
           num]
          [(char=? (string-ref content ix) #\newline)
           (loop (add1 ix) (add1 num))]
          [else
           (loop (add1 ix) num)])))

(time (position-paragraph 4000))
;; cpu time: 0 real time: 0 gc time: 0
;; 67

(time (position-paragraph 30000))
;; cpu time: 0 real time: 0 gc time: 0
;; 500

(time (position-paragraph (sub1 (string-length content))))
;; cpu time: 2 real time: 2 gc time: 0
;; 5000

The second example is a style guide size file and the third is something like class-internal.rkt.

And then paragraph-{start end}-position can be the simple search backward/forward for a newline; effectively the {beginning end}-of-line I sketched out before.

rfindler commented 2 years ago

Just because I was curious, I adapted the code above to use racket:text% and upped the counts a bit and it seems that the text object does pretty well on this benchmark. So even if we discover a need for a better setup later on for this method, we'll probably be able to do it without changing the APIs.

#lang racket
(require framework)
;; A 5,000 line file, about 300,000 chars long
(define content
  (for/fold ([s ""])
            ([n (in-range 5000)])
    (string-append s "\nafd;lkajsdf;lkajsfd;lkajsdf;lakjdsfl;kajsdfadsfasdfasdfasfd")))

(define t (new racket:text%))
(send t insert content)
(send t freeze-colorer)
(send t thaw-colorer)

(define (position-paragraph.txt pos)
  (send t position-paragraph pos))

(define (position-paragraph.str pos)
  (define len (string-length content))
  (let loop ([ix 0] [num 0])
    (cond [(or (= ix pos)
               (= ix len))
           num]
          [(char=? (string-ref content ix) #\newline)
           (loop (add1 ix) (add1 num))]
          [else
           (loop (add1 ix) num)])))

(collect-garbage) (collect-garbage) (collect-garbage)

(define COUNT 1000)

(time (for ([x (in-range COUNT)]) (position-paragraph.str 4000)))
(time (for ([x (in-range COUNT)]) (position-paragraph.txt 4000)))
;; cpu time: 15 real time: 15 gc time: 0
;; cpu time: 0 real time: 0 gc time: 0

(time (for ([x (in-range COUNT)]) (position-paragraph.str 30000)))
(time (for ([x (in-range COUNT)]) (position-paragraph.txt 30000)))
;; cpu time: 96 real time: 96 gc time: 0
;; cpu time: 0 real time: 0 gc time: 0

(time (for ([x (in-range COUNT)]) (position-paragraph.str (sub1 (string-length content)))))
(time (for ([x (in-range COUNT)]) (position-paragraph.txt (sub1 (string-length content)))))
;; cpu time: 951 real time: 952 gc time: 0
;; cpu time: 0 real time: 0 gc time: 0

mflatt commented 2 years ago

FWIW, here's a text data structure that has similar performance characteristics to text%: https://gist.github.com/mflatt/6ab71f8214c5fd98dae98c8531056fa2

greghendershott commented 2 years ago

@rfindler Good to know.

@mflatt Thank you for that data structure. I didn't mean to be non-responsive to your offer to help like that. At the moment I'm cycling around various aspects of this, trying to herd the cats so none gets left too far behind. I also have some fussy things on the Emacs integration side to resolve. And I had a concurrency bug with updates from Emacs arriving at the back end out of generation order (much like TCP packets), that I needed to fix. I'll look at Position Paragraph Cat again, soon. Thanks!

greghendershott commented 2 years ago

To report back:

I incorporated the data structure @mflatt provided.
I started doing some benchmarks, comparing the performance of my implementation to that of racket:text%.
I found that although the paragraph/lines methods compared favorably (thanks to 1), the other methods like classify-position and skip-whitespace, that in my implementation were using an interval-map plus some code from expeditor, took about 10-15X the time. Causing a similar multiple especially for their use by a drracket:indentation (a.k.a. "indent-line" a.k.a. "determine-spaces") function.
I dug into token-tree%, which I'd found difficult to understand a year ago. Using that got my implementation to the 2X range. (Without digging deep, I imagine this is because a splay tree is faster than a skip list for these access patterns with such a higher locality of reference.)
I dug into paren-tree% (which, ditto). Using that, too, got my implementation to < 1.0X.

TL;DR the best available pieces were in syntax-color all along. Where I've ended up is that, at most, I am updating them somewhat differently than does color.rkt, due to the inter-process Emacs <-> Racket situation.

Most recently I've shifted back to the Emacs side to test and dog-food.

Assuming things settle down I'll look again at what, if anything, I really have to contribute back to syntax-color-lib, that e.g. expeditor could use. It's not yet clear to me if it should be library code, or example code, or improved documentation for syntax-color, or all of the above.

rfindler commented 2 years ago

Thank you @greghendershott . Sounds like Scott Owens (the original author of that code, and current researcher in the UK working on a verified Cake ML among other things) knew what he was doing!

If there is a way in which we (Emacs mode and DrRacket) could share code, I would be into that. In addition to just the natural benefits of reduced maintenance burden, I would sleep better knowing that Emacs and DrRacket are coloring the same things the same ways!

greghendershott commented 2 years ago

I think he did an excellent job on the optimization!

IIRC a year ago, I think my eyeballs snagged on the documentation:

FIXME: many methods are not yet documented.

plus since I was mentally primed to think of this in terms of intervals, I detoured.

:heavy_check_mark: re sharing code.

greghendershott commented 2 years ago

To report back again: Things seem to be stabilizing for me around this sort of hash-lang% class. I wrote some overview comments near the start, which hopefully give a decent idea of the design.

There's a little bit of bridge/shim code for how it gets used in Emacs, here, which may help give some idea how small the actual "public surface area" is.

What I'm not sure about, from another glance at expeditor-lib, is how useful any of this would be for that. For example is my design too infected with ideas about concurrency that would just be an annoying PITA for expeditor? Am I solving problem it does not have, and not solving problems it does have?

I'm more confident what I have is in the right ballpark as the "back end" to be used by a separate, possibly even remote "front end" program like Emacs or Vim (or maybe even LSP racket-langserver?). Maybe worth contributing to syntax-color-lib on that basis. But I'm not sure. And if no one else would actually use it from syntax-color-lib, my life would be simpler if it continued to live in the Racket Mode source -- where I could continue to refine without breaking other things.

So I don't have an agenda either way -- and certainly I won't feel delighted/insulted if other people do/don't want to make use of it. I'm also open to discussing how it could change to be more useful for other people. For now I just wanted to touch base, share what I've done, and see if anyone has any thoughts. Not urgent from my POV.

mflatt commented 2 years ago

Hi @greghendershott — I suspect your code would be useful in the long run in syntax-color-lib, especially for racket-langserver. Maybe it makes sense to keep your life simpler by having the code in racket-mode until there's movement in racket-langserver, though.

Meanwhile, I've been updating the shrubbery indenter to work right in DrRacket's interactions area, and I've discovered that I need the get-regions method. Robby and I think it makes sense to quickly get get-regions into color-textoid<%> as part of v8.4 (after which it becomes much more difficult to change). Then, as long as the code in your hash-lang branch is not yet live anywhere, you'd just need to add something like (define/public (get-regions) '((0 end))) to sync with the revised API. If your code is already live, though, we'd need to be more careful about this change already.

So, does that change to color-textoid<%> make sense to you, and is it ok if we make the change now?

rfindler commented 2 years ago

If we're not getting too far afield from the original topic here, one thought I had was whether or not adding some part of racket-mode to the main distribution would make sense? Would that make it easier to track multiple versions, since you could reliably count on the racket code to be code that comes with that latest version?

greghendershott commented 2 years ago

@mflatt Adding a get-regions method to color-textoid<%> would be fine.

(The hash-lang branch isn't even being used by anyone yet AFAIK; anyone adventurous enough to try wouldn't be shocked if it broke briefly.)

mflatt commented 2 years ago

@greghendershott Thanks! I've pushed that change.

greghendershott commented 2 years ago

Somewhat OT: I'm still early in my experience using the hash-lang stuff with REPL a.k.a. interactions buffers (as opposed to edit buffers). I thought about using multiple regions as does the framework color.rkt. But I found the multi lexer state regions aspect to make it even harder (for me) to understand that code? So I've been reluctant to do similar, so far. For now I've taken the approach of telling the lexer that all non-user-input is space characters, i.e. "nothing to color here, move along". While I mull the "hopeless mix".

A possible riff on "the top-level is hopeless" is "top-level output is a hopeless mix" --- of REPL prompts, user program printing-module-begin, and user program output (which itself can be a mix of prints and displays). (And, even if written initially to current-error-port :wink:, error messages can be mixed in.) It is pretty hopeless to give this mix to a lexer.

So I'm kind of ruminating on ways to keep these "streams" delineated so they can be treated differently (including but not limited to color-lexing some vs. not). I don't have any great ideas to report yet. I've sort of set that aside to think about at random intervals, while I mainly work on other things for awhile.

greghendershott commented 2 years ago

@rfindler Currently Racket Mode supports Racket versions as old as 6.9. That could probably be bumped a bit newer.

Moving some code into the main distribution as of (say) "8.5" wouldn't really help until N years from now when I drop support for pre-"8.5". Meanwhile I'd still need to "polyfill" (ship a use-this-if-not-found copy of the code) for pre-8.5. And if I fixed/added something for (say) "9.0", it would still be broken/missing before then, so I'd still need to "polyfill", so... I'm not sure how it would help much, ever? I think it's just a fundamentally different release concept than for Dr Racket which is associated with a release of Racket, and the older releases of Racket+DrRacket are available to get.

mflatt commented 2 years ago

For now I've taken the approach of telling the lexer that all non-user-input is space characters, i.e. "nothing to color here, move along". While I mull the "hopeless mix".

JFYA, I took the same approach in adapting the shrubbery indenter (on the client side, in that case). It uses a wrapper around classify-position and get-token-range that reports whitespace outside of the one region that has the queried position.

rfindler commented 2 years ago

DrRacket is not doing things a lot more complicated, if I'm understanding correctly. Specifically, it makes the regions that aren't after the latest prompt uneditable. So it colors those (once) and then can count on just leaving the colors. It also ensures that any IO that might show up is always put before the prompt. So the region that's actively being colored might move forward, but it will always be at some point and then further down to the end from there. In short, it can control where changes to the buffer happen so it does, in order to make this simpler.

greghendershott commented 2 years ago

I agree it's not a lot more complicated. And definitely I didn't mean a criticism that color.rkt is "objectively" "too" complicated. Speaking just personally, my limited brain cells had trouble tracking all the moving parts, even modulo multiple lexer-state regions. That's why I didn't immediately copy that approach... but it might turn out to be necessary and best, I don't know.

rfindler commented 2 years ago

Oh, no problem! I somehow got the impression that you guys thought it was more complicated that it was. I was trying to say that it isn't really using the full power of that method and trying to explain more what it was exactly doing. Sorry for the confusion. Thanks, @greghendershott .

racket / expeditor

"Paragraphs" -> Lines ? #10