Open jgm opened 2 years ago
The issue for filters could be partially mitigated if we provided a function that adds breaks [Str "hi there"]
into [Str "hi",Str " ",Str "there"]
, and a pattern synonym for Space that matches Str " "
. That would allow existing filters that depend on Space to be easily modified.
Some background in #7124
First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible.
First thought: this could make some filters much easier, e.g. search-replace operations for strings containing spaces would be possible.
Yes, it definitely would. I discussed this for my own Pandoc API search-and-replace utility; the existence of Space
nodes (possibly arbitrarily present or absent) makes string search and replace extremely unreliable & difficult because it lifts strings to the AST list level. The only decent solution was to manually erase all Space
nodes to put the AST into a canonical order where Str
nodes contain the runs of text you expect them to contain. If there were no such nodes, I wouldn't have to do that and string munging becomes much more straightforward. It is still difficult to match any kind of mixture of formatting and text*, but no surprise there.
(It has always been opaque to me what the motivation behind having Space
nodes and parsing strings into Space
/Str
lists was, since I haven't noticed any need for it in my own API use, nor it used elsewhere, and erasing them entirely hasn't led to any particular regression in my final HTML outputs. I noticed that the Str
-erasing pass seemed to noticeably slow things down, so I would not be surprised if removing Str
in general delivered performance benefits from much less AST churn & size.)
* eg with my smallcaps code. Imagine you want to rewrite "ABC-a" to Link _ ["ABC-a"] ("https://en.wikipedia.org/wiki/ABC-a" "")
. With a straight search-and-replace of Str "...ABC-a..."
, you're fine; but what if a formatting pass has run before and so it's now actually [Str "...", Span "smallcaps-auto" _ [Str "ABC"], Str "-a..."]
. Trouble.
I think that I originally added Space to have something easy to split on for line wrapping (though the memory is hazy). In any case, we don't need it for that any more, since we use doclayout for that. I don't think there's any compelling reason to have it now, other than backwards compatibility. I remember that when I wrote the cheapskate parser, I originally had Space, then removed it, and things got much faster as a result, so I'd expect that here too.
There are certain kinds of filters that are made easier if you can assume that strings don't contain spaces, e.g. a filter that converts all instances of "AAA" into a particular link. These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link.
I have a small lua filter for inserting non-breakable spaces after specific prefixes, before dashes, etc ... This could simplify that filter workings to few regexes, opposed to checking for presence of Space elements. Also, would make it practically the same to filter that I am using for luaLaTeX compilations (if I am not running pandoc), which is IMHO great. It could even make pandoc AST representation more readable (or the opposite?) With that being said, if this is implemented, it would be great to reference corresponding pandoc release number here in case anybody would be searching for that ...
This will break some of my filters, but the change is a good thing in the long run, because it will make filters that look for a series of words easier to write. If the mitigation @jgm suggested was made available, I would probably use it.
These filters will still be possible without Space, but they'd involve more steps: in this case, finding "AAA" as a word inside a longer string, splitting the longer string, and inserting the link.
If you mean that this filter could be written by literal pattern-matching on replace (Str "AAA") = Str "XYZ"
, I think that is unlikely to be a useful filter: you would still need to search within Str
strings in order to insert the replacement, or else your users will be unpleasantly surprised that "foo AAA bar" replaces but, say, "foo AAA." or "(AAA foo" do not (because they no longer parse into Str "AAA"
but the different literals Str "AAA."
or Str "(AAA"
). The case of exactly space/newline-separated "AAA" strings must be a rare one indeed, and as soon as you go beyond it, you're finding words inside longer strings & re-inventing regexps poorly. All the Space
buys you in this case is an illusion that you implemented what you wanted.
Started work on nospace branch.
I'd like that change. I assume multiple spaces would (usually) be kept
verbatim in the new AST, i.e., one two
would be parsed as
Str "one two"
in most formats?
I assume multiple spaces would (usually) be kept verbatim in the new AST
Yes, that's the current plan.
I've made some progress in converting the pandoc code base, but it's messy.
Note that in the writers, we currently assume that only Space is breakable, so space characters inside Str are not breakable. This assumption is used in various places, e.g. in the Markdown and Org writers, in code designed to avoid new block-level syntax being created inadvertently by line wrapping, e.g. when a hyphen or asterisk wraps and creates a list item. I assume it's going to be possible to change this code so that the bad-wrap detection is done in a different way, but I haven't studied it in detail.
Another thing that definitely gets more complicated is pattern matching to detect inline lists that start with space, and to remove this space. Again, this can be done without a Space element, but it's more complicated.
I got things working enough to do some benchmarks with -f markdown-smart -t html
(-smart
because the abbreviation code is still not converted). This is after modifying the str
parser in the markdown reader to parse larger spans of text, including spaces.
master:
1,549,435,104 bytes allocated in the heap
102,699,664 bytes copied during GC
13,945,072 bytes maximum residency (10 sample(s))
79,632 bytes maximum slop
42 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 182 colls, 0 par 0.047s 0.048s 0.0003s 0.0017s
Gen 1 10 colls, 0 par 0.054s 0.058s 0.0058s 0.0122s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.156s ( 0.171s elapsed)
GC time 0.102s ( 0.106s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 0.258s ( 0.286s elapsed)
Alloc rate 9,902,821,760 bytes per MUT second
Productivity 60.6% of total user, 59.8% of total elapsed
nospace branch:
1,264,393,400 bytes allocated in the heap
75,768,304 bytes copied during GC
10,495,536 bytes maximum residency (10 sample(s))
77,232 bytes maximum slop
35 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 144 colls, 0 par 0.036s 0.036s 0.0003s 0.0012s
Gen 1 10 colls, 0 par 0.044s 0.048s 0.0048s 0.0117s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.133s ( 0.147s elapsed)
GC time 0.080s ( 0.084s elapsed)
EXIT time 0.000s ( 0.003s elapsed)
Total time 0.213s ( 0.236s elapsed)
Alloc rate 9,532,232,140 bytes per MUT second
Productivity 62.2% of total user, 62.3% of total elapsed
The nospace
branch uses significantly less memory (35 vs 42), does fewer GCs (144 vs 182), and finishes faster (0.213s vs 0.258s). However, the differences aren't huge. I'm still not completely convinced that the advantages in performance, or the advantages in facilitating filter writing, outweigh the inconvenience of a big AST change like this. As for filter writing, it's always possible to wrap your filter in a utility function that, first, removes spaces, then adds them again.
Proof of concept:
{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Definition
import qualified Text.Pandoc.Builder as B
import qualified Data.Text as T
import Text.Pandoc.Walk (walk)
import Data.List (intersperse)
removeSpace :: [Inline] -> [Inline]
removeSpace = B.toList . mconcat . map go
where
go Space = B.str " "
go x = B.singleton x
addSpace :: [Inline] -> [Inline]
addSpace = B.toList . mconcat . foldr go mempty
where
go (Str t) = (B.text t :)
go x = (B.singleton x :)
test :: IO ()
test = do
let doc = [Str "hello", Space, Str "there", Space, Emph [Str "friend"]]
print $ walk (addSpace . map specialFilter . removeSpace) $ doc
specialFilter :: Inline -> Inline
specialFilter (Str t) = Str $ T.replace "hello there" "hi ya" t
specialFilter x = x
I assume multiple spaces would (usually) be kept verbatim in the new AST
@tarleb is there any reason they should be?
@tarleb is there any reason they should be?
The one example that I have is my head is sentences. People had been asking for ways to markup sentences, but it's difficult to distinguish between sentence endings and abbreviations. Keeping any double spaces a period could allow users to write filters which split on sentences.
Yes, that's something to keep in mind. We can certainly represent a doubled space in a Str element. However, the current behavior of Text.Pandoc.Builder.text is to collapse adjacent spaces, so if a reader relies on that, doubled spaces won't be preserved. Of course, the reader itself could be made sensitive to (sentence-ending punctuation + doubled space) and preserve it in this case (using B.str rather than B.text). [EDIT: That's not quite right; the Monoid instance will also collapse spaces on append.]
Hello, any progress on this? I currently replace Space with Str(" ") and then merge consecutive Str with a filter, but it's not really convenient and sometimes slow.
See above. I'm not persuaded that the advantages outweigh the rather large drawbacks of an API change.
The API doesn't necessarily have to change. Perfect is the enemy of better: Space could be left in the AST type just as it is now, but deprecated and not generated by Pandoc, so filter writers could make the assumption of its absence (and querying for the existence of any Space node is presumably cheap, as well as Pandoc warning about use of deprecated Space nodes).
Such a required invariant is no worse than many parts of the status quo like using partial functions such as head/tail, or having fall-through AST pattern-matching, or not necessarily handling stuff wrapped in RawInline/Block elements, or the lack of Attr on many elements which ideally would have it etc.
And if an author does write a filter or relies on a dependency which re-introduces Space nodes, that is then their personal problem to deal with, and the burden of collapsing Space/Str nodes to pass on a clean AST to subsequent filters also their problem.
Thus far from the comments, far more filters are hindered than helped by the inconsistent confusion of Space/Str-with-whitespace-in-them nodes, and I think it's fair that the programming & performance burden be put on the few users who demand to use Space for whatever reason, and not on the many users who don't use it.
By API change, I don't just mean removing the Space constructor -- I mean removing the expectation that Space will be used for spaces as it is now. There may be many filters and third-party tools that rely on this expectation (as well as many parts of the pandoc code, but at least that we control ourselves). So I'd predict (based on many past experiences) that the change would cause a lot of things to break. One can debate about where the "burden" should lie, but I think there's a presumption against making changes that break existing things.
This is a fairly radical change, so I'm posting this here for comment. The AST has a Space constructor for inlines, so we get:
The suggestion is to remove Space so that we get instead:
I'm still unclear what this might break in the pandoc code base -- and it might also break things in third-party filters. It would likely make the code much faster, though, and remove some unnecessary complexity. It would also increase the readability of the native AST.