Allow multiline Markdown headers

GoogleCodeExporter commented 9 years ago

It would be nice if pandoc markdown could allow multiline headers -- that is, 
multiline in the input, not necessarily in the output (though that could be 
controlled by a backslash or double-space). This can be an issue with chapter 
titles, which sometimes have subtitles, and go well beyond 80 characters.

For level 1 and 2 in setext style, the following syntax seems intuitive:

    Long Header Containing a Clever Title for This Section:
    An Overly Wordy Subtitle Explaining What the Clever Title Meant
    ===============================================================

(with an optional backslash after the colon)

Since pandoc breaks from Markdown.pl on requiring a blank line before a header, 
this wouldn't seem to introduce any new compatibility issues.

I'm not sure if there's an intuitive way in atx headers, though.

Not a huge deal on the writing end, since I can turn off auto-fill-mode for atx 
headers in emacs automatically, but it seems like it would be a win as far as 
readability goes.

Original issue reported on code.google.com by jesse.k....@gmail.com on 17 Feb 2011 at 3:22

GoogleCodeExporter commented 9 years ago

The attached patch seems to do the job. Of course, I've just been playing 
around with it in my own use cases, so there well might be unforeseen problems 
(though it does pass the tests). But, as long as I was playing around with it, 
I figured I'd pass it on.

One possible change to the above syntax would be to mandate that additional 
lines in a header be indented four spaces. This might stop accidental long 
headers. (Though as pandoc works now, the same accidental running together of 
header lines would also produce unexpected output, since it requires the blank 
line before.)

Original comment by jesse.k....@gmail.com on 18 Feb 2011 at 7:16

Attachments:

0001-Mardown-reader-initial-stab-at-multiline-headers.patch

GoogleCodeExporter commented 9 years ago

I'm worried about performance implications. This patch means that pandoc will 
have to parse every paragraph twice -- first as part of the header parser, and 
then (when that fails) as part of the paragraph parser.

This could be fixed, perhaps, by integrating the setext header parser with the 
para parser. 

The para parser could do something like this:

Parse one or more inlines that are not followed by a setext border (i.e., 
newline + row of = or - + blankline).  Then, try to parse a setext border.  If 
that succeeds, return a Header, otherwise return a Para.

This might still have performance implications, though, since you'd have to be 
looking for the setext border as you move through the paragraph.  But it might 
actually improve performance, because we wouldn't have to try the setext header 
parser on the first line of every block.

You can benchmark after building using

runghc Benchmark.hs markdown

(for which you need the criterion package installed).

Original comment by fiddloso...@gmail.com on 18 Feb 2011 at 8:34

GoogleCodeExporter commented 9 years ago

Ack -- never mind. Slows way down with long docs. I imagine it's busy looking 
over all multiline elements trying to find out if there's a header at the 
bottom.

Original comment by jesse.k....@gmail.com on 18 Feb 2011 at 8:36

GoogleCodeExporter commented 9 years ago

Ah, looks like we were getting the same results at the same time. I'll play 
aroudn with the para parser and see if I can turn out something that behaves 
properly over a large doc.

Original comment by jesse.k....@gmail.com on 18 Feb 2011 at 8:37

GoogleCodeExporter commented 9 years ago

I've implemented your suggestion, John, and it works without any performance 
hit. (Actually came out ahead, though within the std. dev.) There is one 
outstanding tricky problem, though, that I'd love suggestions on.

Right now, pandoc tries to parse sections, and then tables, and then 
paragraphs. Therefore, if we put setext parsing in the para parser, h2-level 
headers, with single dashes, are parsed as tables before the para parser can 
get to them. Any attempt I've made at reordering created more problems than it 
fixed.

The issue seems to be that the table parser currently accepts a one-column 
table, but it will never see it, because a one-column table is parsed as 
level-2 header. Would it be worth it to try to change the table parser so it 
requires at least two columns, since that's all it should ever see? If so, I 
could have a go at it and include it as part of this patchset.

Original comment by jesse.k....@gmail.com on 19 Feb 2011 at 9:31

GoogleCodeExporter commented 9 years ago

Or, to be a bit more precise, the current behavior is that you can have a 
one-column table, provided it starts with a dashed line above the text. I.e, 
this is a table:

----
Foo
---
Bar
---

and this is two h2-headers:

Foo
---
Bar
---

but this behavior isn't built into the table parser. I guess the question is 
whether the table parser should mandate that one-column tables have to start 
with a dashed line on top in order to avoid overlap with the setext h2 parser.

Original comment by jesse.k....@gmail.com on 20 Feb 2011 at 2:56

GoogleCodeExporter commented 9 years ago

Okay, I applied the table fix mentioned above in the simplest possible way. 
Since it's only a question on simple tables with headers, I put a `try' and a 
`guard' in the parser. I imagine there's a more optimized way to do it with a 
lookahead, or perhaps in the header parser, but I'm being a bit hands off 
because I don't fully understand the ins and outs of that part of the code. In 
any case, it passes the test, and there doesn't seem to be a performance hit in 
the benchmarks -- perhaps because tables are relatively rare.

Second-stab patchset attached.

Original comment by jesse.k....@gmail.com on 20 Feb 2011 at 9:35

Attachments:

GoogleCodeExporter commented 9 years ago

Thanks for all of this.  I need to think a bit more about the one-column table 
issue.  It seems desirable to allow one-column tables, but I think it's 
probably enough if you can do it with a line above.  So this seems the right 
approach.

Original comment by fiddloso...@gmail.com on 21 Feb 2011 at 1:17

GoogleCodeExporter commented 9 years ago

Just had another look at this and realized I maybe was a bit unclear. The 
limitations in one-column tables are already there, regardless of multiline 
headers, due to an overlap with the setext definition. So these patches don't 
actually change table behavior at all.

Apologies if unnecessary -- I got a bit caught up myself trying to figure out 
what I meant by "the current behavior."

Original comment by jesse.k....@gmail.com on 23 Feb 2011 at 6:16

GoogleCodeExporter commented 9 years ago

Hi John,

Just wondering if you had come to a decision about whether or not to accept 
this feature (not to say this implementation). Or is it awaiting some free time 
to think it through -- perhaps the end of the semester?

No particular rush -- the patches have been working for me, without bugs or any 
noticable performance hit. I've been holding off on further stuff, like tests 
and `strict` behavior, but if you were interested, I might turn my attention to 
it once my semester was over.

--J

Original comment by jesse.k....@gmail.com on 20 Apr 2011 at 8:32

GoogleCodeExporter commented 9 years ago

I haven't had time to think through all the implications yet.  I'll try to get 
to it in the next couple months.

Original comment by fiddloso...@gmail.com on 20 Apr 2011 at 8:48

GoogleCodeExporter commented 9 years ago

Cool -- thanks.

Original comment by jesse.k....@gmail.com on 20 Apr 2011 at 8:52

khajavi / pandoc

Allow multiline Markdown headers #289