Open GoogleCodeExporter opened 9 years ago
The attached patch seems to do the job. Of course, I've just been playing
around with it in my own use cases, so there well might be unforeseen problems
(though it does pass the tests). But, as long as I was playing around with it,
I figured I'd pass it on.
One possible change to the above syntax would be to mandate that additional
lines in a header be indented four spaces. This might stop accidental long
headers. (Though as pandoc works now, the same accidental running together of
header lines would also produce unexpected output, since it requires the blank
line before.)
Original comment by jesse.k....@gmail.com
on 18 Feb 2011 at 7:16
Attachments:
I'm worried about performance implications. This patch means that pandoc will
have to parse every paragraph twice -- first as part of the header parser, and
then (when that fails) as part of the paragraph parser.
This could be fixed, perhaps, by integrating the setext header parser with the
para parser.
The para parser could do something like this:
Parse one or more inlines that are not followed by a setext border (i.e.,
newline + row of = or - + blankline). Then, try to parse a setext border. If
that succeeds, return a Header, otherwise return a Para.
This might still have performance implications, though, since you'd have to be
looking for the setext border as you move through the paragraph. But it might
actually improve performance, because we wouldn't have to try the setext header
parser on the first line of every block.
You can benchmark after building using
runghc Benchmark.hs markdown
(for which you need the criterion package installed).
Original comment by fiddloso...@gmail.com
on 18 Feb 2011 at 8:34
Ack -- never mind. Slows way down with long docs. I imagine it's busy looking
over all multiline elements trying to find out if there's a header at the
bottom.
Original comment by jesse.k....@gmail.com
on 18 Feb 2011 at 8:36
Ah, looks like we were getting the same results at the same time. I'll play
aroudn with the para parser and see if I can turn out something that behaves
properly over a large doc.
Original comment by jesse.k....@gmail.com
on 18 Feb 2011 at 8:37
I've implemented your suggestion, John, and it works without any performance
hit. (Actually came out ahead, though within the std. dev.) There is one
outstanding tricky problem, though, that I'd love suggestions on.
Right now, pandoc tries to parse sections, and then tables, and then
paragraphs. Therefore, if we put setext parsing in the para parser, h2-level
headers, with single dashes, are parsed as tables before the para parser can
get to them. Any attempt I've made at reordering created more problems than it
fixed.
The issue seems to be that the table parser currently accepts a one-column
table, but it will never see it, because a one-column table is parsed as
level-2 header. Would it be worth it to try to change the table parser so it
requires at least two columns, since that's all it should ever see? If so, I
could have a go at it and include it as part of this patchset.
Original comment by jesse.k....@gmail.com
on 19 Feb 2011 at 9:31
Or, to be a bit more precise, the current behavior is that you can have a
one-column table, provided it starts with a dashed line above the text. I.e,
this is a table:
----
Foo
---
Bar
---
and this is two h2-headers:
Foo
---
Bar
---
but this behavior isn't built into the table parser. I guess the question is
whether the table parser should mandate that one-column tables have to start
with a dashed line on top in order to avoid overlap with the setext h2 parser.
Original comment by jesse.k....@gmail.com
on 20 Feb 2011 at 2:56
Okay, I applied the table fix mentioned above in the simplest possible way.
Since it's only a question on simple tables with headers, I put a `try' and a
`guard' in the parser. I imagine there's a more optimized way to do it with a
lookahead, or perhaps in the header parser, but I'm being a bit hands off
because I don't fully understand the ins and outs of that part of the code. In
any case, it passes the test, and there doesn't seem to be a performance hit in
the benchmarks -- perhaps because tables are relatively rare.
Second-stab patchset attached.
Original comment by jesse.k....@gmail.com
on 20 Feb 2011 at 9:35
Attachments:
Thanks for all of this. I need to think a bit more about the one-column table
issue. It seems desirable to allow one-column tables, but I think it's
probably enough if you can do it with a line above. So this seems the right
approach.
Original comment by fiddloso...@gmail.com
on 21 Feb 2011 at 1:17
Just had another look at this and realized I maybe was a bit unclear. The
limitations in one-column tables are already there, regardless of multiline
headers, due to an overlap with the setext definition. So these patches don't
actually change table behavior at all.
Apologies if unnecessary -- I got a bit caught up myself trying to figure out
what I meant by "the current behavior."
Original comment by jesse.k....@gmail.com
on 23 Feb 2011 at 6:16
Hi John,
Just wondering if you had come to a decision about whether or not to accept
this feature (not to say this implementation). Or is it awaiting some free time
to think it through -- perhaps the end of the semester?
No particular rush -- the patches have been working for me, without bugs or any
noticable performance hit. I've been holding off on further stuff, like tests
and `strict` behavior, but if you were interested, I might turn my attention to
it once my semester was over.
--J
Original comment by jesse.k....@gmail.com
on 20 Apr 2011 at 8:32
I haven't had time to think through all the implications yet. I'll try to get
to it in the next couple months.
Original comment by fiddloso...@gmail.com
on 20 Apr 2011 at 8:48
Cool -- thanks.
Original comment by jesse.k....@gmail.com
on 20 Apr 2011 at 8:52
Original issue reported on code.google.com by
jesse.k....@gmail.com
on 17 Feb 2011 at 3:22