jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.31k stars 3.31k forks source link

Proposal: remove Space element in AST #7579

Open jgm opened 2 years ago

jgm commented 2 years ago

This is a fairly radical change, so I'm posting this here for comment. The AST has a Space constructor for inlines, so we get:

% pandoc -t native
Hi there
[Para [Str "Hi",Space,Str "there"]]

The suggestion is to remove Space so that we get instead:

[Para [Str "Hi there"]]

I'm still unclear what this might break in the pandoc code base -- and it might also break things in third-party filters. It would likely make the code much faster, though, and remove some unnecessary complexity. It would also increase the readability of the native AST.

jgm commented 2 years ago

The issue for filters could be partially mitigated if we provided a function that adds breaks [Str "hi there"] into [Str "hi",Str " ",Str "there"], and a pattern synonym for Space that matches Str " ". That would allow existing filters that depend on Space to be easily modified.

jgm commented 2 years ago

Some background in #7124

denismaier commented 2 years ago

First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible.

gwern commented 2 years ago

First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible.

Yes, it definitely would. I discussed this for my own Pandoc API search-and-replace utility; the existence of Space nodes (possibly arbitrarily present or absent) makes string search and replace extremely unreliable & difficult because it lifts strings to the AST list level. The only decent solution was to manually erase all Space nodes to put the AST into a canonical order where Str nodes contain the runs of text you expect them to contain. If there were no such nodes, I wouldn't have to do that and string munging becomes much more straightforward. It is still difficult to match any kind of mixture of formatting and text*, but no surprise there.

(It has always been opaque to me what the motivation behind having Space nodes and parsing strings into Space/Str lists was, since I haven't noticed any need for it in my own API use, nor it used elsewhere, and erasing them entirely hasn't led to any particular regression in my final HTML outputs. I noticed that the Str-erasing pass seemed to noticeably slow things down, so I would not be surprised if removing Str in general delivered performance benefits from much less AST churn & size.)

* eg with my smallcaps code. Imagine you want to rewrite "ABC-a" to Link _ ["ABC-a"] ("https://en.wikipedia.org/wiki/ABC-a" ""). With a straight search-and-replace of Str "...ABC-a...", you're fine; but what if a formatting pass has run before and so it's now actually [Str "...", Span "smallcaps-auto" _ [Str "ABC"], Str "-a..."]. Trouble.

jgm commented 2 years ago

I think that I originally added Space to have something easy to split on for line wrapping (though the memory is hazy). In any case, we don't need it for that any more, since we use doclayout for that. I don't think there's any compelling reason to have it now, other than backwards compatibility. I remember that when I wrote the cheapskate parser, I originally had Space, then removed it, and things got much faster as a result, so I'd expect that here too.

There are certain kinds of filters that are made easier if you can assume that strings don't contain spaces, e.g. a filter that converts all instances of "AAA" into a particular link. These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link.

Delanii commented 2 years ago

I have a small lua filter for inserting non-breakable spaces after specific prefixes, before dashes, etc ... This could simplify that filter workings to few regexes, opposed to checking for presence of Space elements. Also, would make it practically the same to filter that I am using for luaLaTeX compilations (if I am not running pandoc), which is IMHO great. It could even make pandoc AST representation more readable (or the opposite?) With that being said, if this is implemented, it would be great to reference corresponding pandoc release number here in case anybody would be searching for that ...

fiapps commented 2 years ago

This will break some of my filters, but the change is a good thing in the long run, because it will make filters that look for a series of words easier to write. If the mitigation @jgm suggested was made available, I would probably use it.

gwern commented 2 years ago

These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link.

If you mean that this filter could be written by literal pattern-matching on replace (Str "AAA") = Str "XYZ", I think that is unlikely to be a useful filter: you would still need to search within Str strings in order to insert the replacement, or else your users will be unpleasantly surprised that "foo AAA bar" replaces but, say, "foo AAA." or "(AAA foo" do not (because they no longer parse into Str "AAA" but the different literals Str "AAA." or Str "(AAA"). The case of exactly space/newline-separated "AAA" strings must be a rare one indeed, and as soon as you go beyond it, you're finding words inside longer strings & re-inventing regexps poorly. All the Space buys you in this case is an illusion that you implemented what you wanted.

jgm commented 2 years ago

Started work on nospace branch.

tarleb commented 2 years ago

I'd like that change. I assume multiple spaces would (usually) be kept verbatim in the new AST, i.e., one two would be parsed as Str "one two" in most formats?

jgm commented 2 years ago

I assume multiple spaces would (usually) be kept verbatim in the new AST

Yes, that's the current plan.

jgm commented 2 years ago

I've made some progress in converting the pandoc code base, but it's messy.

Note that in the writers, we currently assume that only Space is breakable, so space characters inside Str are not breakable. This assumption is used in various places, e.g. in the Markdown and Org writers, in code designed to avoid new block-level syntax being created inadvertently by line wrapping, e.g. when a hyphen or asterisk wraps and creates a list item. I assume it's going to be possible to change this code so that the bad-wrap detection is done in a different way, but I haven't studied it in detail.

Another thing that definitely gets more complicated is pattern matching to detect inline lists that start with space, and to remove this space. Again, this can be done without a Space element, but it's more complicated.

jgm commented 2 years ago

I got things working enough to do some benchmarks with -f markdown-smart -t html (-smart because the abbreviation code is still not converted). This is after modifying the str parser in the markdown reader to parse larger spans of text, including spaces.

master:

   1,549,435,104 bytes allocated in the heap
     102,699,664 bytes copied during GC
      13,945,072 bytes maximum residency (10 sample(s))
          79,632 bytes maximum slop
              42 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       182 colls,     0 par    0.047s   0.048s     0.0003s    0.0017s
  Gen  1        10 colls,     0 par    0.054s   0.058s     0.0058s    0.0122s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.156s  (  0.171s elapsed)
  GC      time    0.102s  (  0.106s elapsed)
  EXIT    time    0.000s  (  0.007s elapsed)
  Total   time    0.258s  (  0.286s elapsed)

  Alloc rate    9,902,821,760 bytes per MUT second

  Productivity  60.6% of total user, 59.8% of total elapsed

nospace branch:

   1,264,393,400 bytes allocated in the heap
      75,768,304 bytes copied during GC
      10,495,536 bytes maximum residency (10 sample(s))
          77,232 bytes maximum slop
              35 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       144 colls,     0 par    0.036s   0.036s     0.0003s    0.0012s
  Gen  1        10 colls,     0 par    0.044s   0.048s     0.0048s    0.0117s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.133s  (  0.147s elapsed)
  GC      time    0.080s  (  0.084s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time    0.213s  (  0.236s elapsed)

  Alloc rate    9,532,232,140 bytes per MUT second

  Productivity  62.2% of total user, 62.3% of total elapsed

The nospace branch uses significantly less memory (35 vs 42), does fewer GCs (144 vs 182), and finishes faster (0.213s vs 0.258s). However, the differences aren't huge. I'm still not completely convinced that the advantages in performance, or the advantages in facilitating filter writing, outweigh the inconvenience of a big AST change like this. As for filter writing, it's always possible to wrap your filter in a utility function that, first, removes spaces, then adds them again.

jgm commented 2 years ago

Proof of concept:

{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Definition
import qualified Text.Pandoc.Builder as B
import qualified Data.Text as T
import Text.Pandoc.Walk (walk)
import Data.List (intersperse)

removeSpace :: [Inline] -> [Inline]
removeSpace = B.toList . mconcat . map go
 where
  go Space = B.str " "
  go x = B.singleton x

addSpace :: [Inline] -> [Inline]
addSpace = B.toList . mconcat . foldr go mempty
 where
  go (Str t) = (B.text t :)
  go x = (B.singleton x :)

test :: IO ()
test = do
  let doc = [Str "hello", Space, Str "there", Space, Emph [Str "friend"]]
  print $ walk (addSpace . map specialFilter . removeSpace) $ doc

specialFilter :: Inline -> Inline
specialFilter  (Str t) = Str $ T.replace "hello there" "hi ya" t
specialFilter x = x
jgm commented 2 years ago

I assume multiple spaces would (usually) be kept verbatim in the new AST

@tarleb is there any reason they should be?

tarleb commented 2 years ago

@tarleb is there any reason they should be?

The one example that I have is my head is sentences. People had been asking for ways to markup sentences, but it's difficult to distinguish between sentence endings and abbreviations. Keeping any double spaces a period could allow users to write filters which split on sentences.

jgm commented 2 years ago

Yes, that's something to keep in mind. We can certainly represent a doubled space in a Str element. However, the current behavior of Text.Pandoc.Builder.text is to collapse adjacent spaces, so if a reader relies on that, doubled spaces won't be preserved. Of course, the reader itself could be made sensitive to (sentence-ending punctuation + doubled space) and preserve it in this case (using B.str rather than B.text). [EDIT: That's not quite right; the Monoid instance will also collapse spaces on append.]

lewer commented 2 years ago

Hello, any progress on this? I currently replace Space with Str(" ") and then merge consecutive Str with a filter, but it's not really convenient and sometimes slow.

jgm commented 2 years ago

See above. I'm not persuaded that the advantages outweigh the rather large drawbacks of an API change.

gwern commented 2 years ago

The API doesn't necessarily have to change. Perfect is the enemy of better: Space could be left in the AST type just as it is now, but deprecated and not generated by Pandoc, so filter writers could make the assumption of its absence (and querying for the existence of any Space node is presumably cheap, as well as Pandoc warning about use of deprecated Space nodes).

Such a required invariant is no worse than many parts of the status quo like using partial functions such as head/tail, or having fall-through AST pattern-matching, or not necessarily handling stuff wrapped in RawInline/Block elements, or the lack of Attr on many elements which ideally would have it etc.

And if an author does write a filter or relies on a dependency which re-introduces Space nodes, that is then their personal problem to deal with, and the burden of collapsing Space/Str nodes to pass on a clean AST to subsequent filters also their problem.

Thus far from the comments, far more filters are hindered than helped by the inconsistent confusion of Space/Str-with-whitespace-in-them nodes, and I think it's fair that the programming & performance burden be put on the few users who demand to use Space for whatever reason, and not on the many users who don't use it.

jgm commented 2 years ago

By API change, I don't just mean removing the Space constructor -- I mean removing the expectation that Space will be used for spaces as it is now. There may be many filters and third-party tools that rely on this expectation (as well as many parts of the pandoc code, but at least that we control ourselves). So I'd predict (based on many past experiences) that the change would cause a lot of things to break. One can debate about where the "burden" should lie, but I think there's a presumption against making changes that break existing things.