Closed dfc closed 3 years ago
The output of --trace
is confusing for me. I ran pandoc -f html -t markdown --trace --verbose example
on the following file (i added the line numbers in the example file to make it easier to decode the trace output):
1 <html>
2 <head>
3 <title>pressence of title in head does not change behavior</title>
4 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
5 <meta http-equiv="Content-Style-Type" content="text/css">
6 </head>
7 <body>
8 <h1 class="title">This text disappears</h1>
9 Some text
10 </body>
11 </html>
$ pandoc --verbose --trace -f html -t markdown -S example
line 1: []
line 1: []
line 1: []
line 6: []
line 7: []
line 7: []
line 8: [Plain [Str "Some",Space,Str "text"]]
line 6: [Plain [Str "Some",Space,Str "text"]]
line 10: []
line 10: []
line 11: []
Some text
$
It seems odd that line 8 is parsed/printed before line 6 AND that they are the same. Hopefully this helps you out.
I apologize for not including this earlier.
pandoc 1.15.0.4
Compiled with texmath 0.8.2.2, highlighting-kate 0.6.
Syntax highlighting is supported for the following languages:
abc, actionscript, ada, agda, apache, asn1, asp, awk, bash, bibtex, boo, c,
changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css,
curry, d, diff, djangotemplate, dockerfile, dot, doxygen, doxygenlua, dtd,
eiffel, email, erlang, fasm, fortran, fsharp, gcc, glsl, gnuassembler, go,
haskell, haxe, html, idris, ini, isocpp, java, javadoc, javascript, json,
jsp, julia, kotlin, latex, lex, lilypond, literatecurry, literatehaskell,
lua, m4, makefile, mandoc, markdown, mathematica, matlab, maxima, mediawiki,
metafont, mips, modelines, modula2, modula3, monobasic, nasm, noweb,
objectivec, objectivecpp, ocaml, octave, opencl, pascal, perl, php, pike,
postscript, prolog, pure, python, r, relaxng, relaxngcompact, rest, rhtml,
roff, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql,
sqlpostgresql, tcl, tcsh, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
yacc, yaml, zsh
Default user data directory: /home/dfc/.pandoc
Copyright (C) 2006-2015 John MacFarlane
Web: http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
Thanks for your help Prof. Macfarlane.
Yes, this is intentional. When pandoc produces HTML itself, it uses an h1 with class "title" to render the document title. So when parsing HTML it doesn't include this in the document body.
This behavior goes WAY back, but it's probably not the best idea. I think I was doing round trip testing, and without this behavior you'd get a double title on round trip...
+++ Douglas Calvert [Jul 11 15 14:47 ]:
pandoc drops H1 headings if they contain class=title.
Example html file:
presence of title in head does not change behavior This text disappears
Some textImporting the above html to markdown yields: Some text
I was expecting
This text disappears
Some text
In a strange turn of events H2s with class=title are not ignored.
Importing:
words This text disappears
Some textH2s dont disappear
This text does not disappear
Some more textMore text appears here
Yields the following: Some text
H2s dont disappear {.title}
This text does not disappear {.something}
Some more text
More text appears here {.something}
— Reply to this email directly or [1]view it on GitHub.
References
@jgm Pardon me if I'm totally off:
isn't it better to convert the title in HTML document head to title in meta-data block in markdown (and not treat h1 with title
class in a special way)?
If the current behavior is to be maintained, perhaps documenting it might be helpful.
BTW, I did a quick test using title
class in markdown but the HTML output's title
is empty in the head
section:
cat <<EOF | pandoc -t html -s
#Title {.title}
body
EOF
Result:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<title></title>
<style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="title" class="title">Title</h1>
<p>body</p>
</body>
</html>
+++ nkalvi [Jul 11 15 15:55 ]:
[1]@jgm Pardon me if I'm totally off: isn't it better to convert the title in HTML document head to title in meta-data block in markdown (and not treat h1 with title class in a special way)? If the current behavior is to be maintained, perhaps documenting it might be helpful.
<title>
in <head>
IS converted to title in metadata.
The issue is that, often, the <h1 class="title">
element
is just a visible manifestation on the page of the title.
So if we parse it like a regular h1, we get two versions
of the same thing -- one in the metadata title
, another
as a header in the body. And this will give bad results
for most of pandoc's output formats -- e.g. if you try
it in LaTeX, you'll get a title printed on the page,
and then a first-level section with the same text.
BTW, I did a quick test using title class in markdown but the HTML output's title is empty in the head section:
Right, this is as expected -- the special behavior above does not make sense for Markdown, since in Markdown you have a visual representation of the title already in the metadata, and there's no need to repeat it as an element of the body.
@jgm Thanks for kindly explaining this; I did see the title being treated as expected in HTML head/meta data block. Wouldn't it be helpful to indicate it in the documentation still?
Could the handling improved by checking the presence of title first before suppressing it?
Yes, this is intentional. When pandoc produces HTML itself, it uses an h1 with class "title" to render the document title. So when parsing HTML it doesn't include this in the document body.
This behavior goes WAY back, but it's probably not the best idea. I think I was doing round trip testing, and without this behavior you'd get a double title on round trip...
Hello Prof. MacFarlane (@jgm),
I promise I am not trying to be obtuse. Your explanation makes when I read it but when I try to apply this understanding everything falls apart.
Given my understanding that pandoc uses/treats/approaches H1s with class=title as the title of an html file; given the following input:
<h1 class="title">This is a L1 Heading Title</h1>
Some text
<h2>Second Level Heading</h2>
Some more text
I expect that pandoc -f html -t markdown -s --atx-headers input
will yield:
---
title: This is a L1 Heading Title
...
Some text
## Second Level Heading
Some more text
The expectation is that pandoc will use the thing that pandoc commonly considers a title as the title for the created document.
Side Note:
I understand that the world is messy and at times compromises need to be
made. But it seems a little strange to me that pandoc drops input without
warning the user. Maybe --trace
or --verbose
could be put to work
in order to provide the user with some sort of notification that
pandoc intentionally threw away some input it was given.
@dfc From what I understood, <h1 class="title">This is a L1 Heading Title</h1>
is considered
is just a visible manifestation on the page of the title.
i.e. Just a repeat of the <title>This is a L1 Heading Title</title>
in the <head>
of HTML document,
and thus ignored.
I still feel it can be handled better (see my earlier comments). BTW, obtuse? - I like it :smile:
Douglas, you're right that the <h1 class="title">
isn't
used to populate the "title" field of metadata on reading
HTML. The <title>
element of the head is used for that,
and I don't think we'd want to override that. Perhaps if
<title>
is empty it would make sense to do this, I don't
know.
I'm half tempted just to remove the special treatment of
<h1 class="title">
. You are not the first to ask about
it. On the other hand, it's hard to know how many people
might be relying on this behavior.
FWIW, not me. :-) Remove away.
For the record, I also ran into this issue. Removing this behavior would make my life easier in this particular case.
pandoc drops H1 headings if they contain
class=title
.Example html file:
Importing the above html to markdown yields:
I was expecting
In a strange turn of events H2s with
class=title
are not ignored.Importing:
Yields the following: