jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.17k stars 3.3k forks source link

HTML import drops <h1 class="title"> #2293

Closed dfc closed 3 years ago

dfc commented 8 years ago

pandoc drops H1 headings if they contain class=title.

Example html file:


<html>
<head>
<title>presence of title in head does not change behavior</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<h1 class="title">This text disappears</h1>
Some text
</body>
</html>

Importing the above html to markdown yields:

Some text

I was expecting

This text disappears
================

Some text

In a strange turn of events H2s with class=title are not ignored.

Importing:


<html>
<head>
<title>words</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<h1 class="title">This text disappears</h1>

Some text

<h2 class="title">H2s dont disappear</h2>

<h1 class="something">This text does not disappear</h1>

Some more text

<h2 class="something">More text appears here</h2>

</body>
</html>

Yields the following:

Some text

H2s dont disappear {.title}
------------------

This text does not disappear {.something}
============================

Some more text

More text appears here {.something}
----------------------
dfc commented 8 years ago

The output of --trace is confusing for me. I ran pandoc -f html -t markdown --trace --verbose example on the following file (i added the line numbers in the example file to make it easier to decode the trace output):

     1  <html>
     2  <head>
     3  <title>pressence of title in head does not change behavior</title>
     4  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
     5  <meta http-equiv="Content-Style-Type" content="text/css">
     6  </head>
     7  <body>
     8  <h1 class="title">This text disappears</h1>
     9  Some text
    10  </body>
    11  </html>
$ pandoc   --verbose --trace -f html -t markdown  -S  example 
line 1: []
line 1: []
line 1: []
line 6: []
line 7: []
line 7: []
line 8: [Plain [Str "Some",Space,Str "text"]]
line 6: [Plain [Str "Some",Space,Str "text"]]
line 10: []
line 10: []
line 11: []
Some text
$

It seems odd that line 8 is parsed/printed before line 6 AND that they are the same. Hopefully this helps you out.

I apologize for not including this earlier.

pandoc 1.15.0.4
Compiled with texmath 0.8.2.2, highlighting-kate 0.6.
Syntax highlighting is supported for the following languages:
    abc, actionscript, ada, agda, apache, asn1, asp, awk, bash, bibtex, boo, c,
    changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css,
    curry, d, diff, djangotemplate, dockerfile, dot, doxygen, doxygenlua, dtd,
    eiffel, email, erlang, fasm, fortran, fsharp, gcc, glsl, gnuassembler, go,
    haskell, haxe, html, idris, ini, isocpp, java, javadoc, javascript, json,
    jsp, julia, kotlin, latex, lex, lilypond, literatecurry, literatehaskell,
    lua, m4, makefile, mandoc, markdown, mathematica, matlab, maxima, mediawiki,
    metafont, mips, modelines, modula2, modula3, monobasic, nasm, noweb,
    objectivec, objectivecpp, ocaml, octave, opencl, pascal, perl, php, pike,
    postscript, prolog, pure, python, r, relaxng, relaxngcompact, rest, rhtml,
    roff, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql,
    sqlpostgresql, tcl, tcsh, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
    yacc, yaml, zsh
Default user data directory: /home/dfc/.pandoc
Copyright (C) 2006-2015 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

Thanks for your help Prof. Macfarlane.

jgm commented 8 years ago

Yes, this is intentional. When pandoc produces HTML itself, it uses an h1 with class "title" to render the document title. So when parsing HTML it doesn't include this in the document body.

This behavior goes WAY back, but it's probably not the best idea. I think I was doing round trip testing, and without this behavior you'd get a double title on round trip...

+++ Douglas Calvert [Jul 11 15 14:47 ]:

pandoc drops H1 headings if they contain class=title.

Example html file:

presence of title in head does not change behavior

This text disappears

Some text

Importing the above html to markdown yields: Some text

I was expecting

This text disappears

Some text

In a strange turn of events H2s with class=title are not ignored.

Importing:

words

This text disappears

Some text

H2s dont disappear

This text does not disappear

Some more text

More text appears here

Yields the following: Some text

H2s dont disappear {.title}

This text does not disappear {.something}

Some more text

More text appears here {.something}

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/2293
nkalvi commented 8 years ago

@jgm Pardon me if I'm totally off: isn't it better to convert the title in HTML document head to title in meta-data block in markdown (and not treat h1 with title class in a special way)? If the current behavior is to be maintained, perhaps documenting it might be helpful.

BTW, I did a quick test using title class in markdown but the HTML output's title is empty in the head section:

cat <<EOF | pandoc -t html -s
#Title {.title}

body
EOF

Result:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="title" class="title">Title</h1>
<p>body</p>
</body>
</html>
jgm commented 8 years ago

+++ nkalvi [Jul 11 15 15:55 ]:

[1]@jgm Pardon me if I'm totally off: isn't it better to convert the title in HTML document head to title in meta-data block in markdown (and not treat h1 with title class in a special way)? If the current behavior is to be maintained, perhaps documenting it might be helpful.

<title> in <head> IS converted to title in metadata. The issue is that, often, the <h1 class="title"> element is just a visible manifestation on the page of the title. So if we parse it like a regular h1, we get two versions of the same thing -- one in the metadata title, another as a header in the body. And this will give bad results for most of pandoc's output formats -- e.g. if you try it in LaTeX, you'll get a title printed on the page, and then a first-level section with the same text.

BTW, I did a quick test using title class in markdown but the HTML output's title is empty in the head section:

Right, this is as expected -- the special behavior above does not make sense for Markdown, since in Markdown you have a visual representation of the title already in the metadata, and there's no need to repeat it as an element of the body.

nkalvi commented 8 years ago

@jgm Thanks for kindly explaining this; I did see the title being treated as expected in HTML head/meta data block. Wouldn't it be helpful to indicate it in the documentation still?

nkalvi commented 8 years ago

Could the handling improved by checking the presence of title first before suppressing it?

dfc commented 8 years ago

Yes, this is intentional. When pandoc produces HTML itself, it uses an h1 with class "title" to render the document title. So when parsing HTML it doesn't include this in the document body.

This behavior goes WAY back, but it's probably not the best idea. I think I was doing round trip testing, and without this behavior you'd get a double title on round trip...

Hello Prof. MacFarlane (@jgm),

I promise I am not trying to be obtuse. Your explanation makes when I read it but when I try to apply this understanding everything falls apart.

Given my understanding that pandoc uses/treats/approaches H1s with class=title as the title of an html file; given the following input:

<h1 class="title">This is a L1 Heading Title</h1>
Some text
<h2>Second Level Heading</h2>
Some more text

I expect that pandoc -f html -t markdown -s --atx-headers input will yield:

---
title: This is a L1 Heading Title
...

Some text

## Second Level Heading

Some more text

The expectation is that pandoc will use the thing that pandoc commonly considers a title as the title for the created document.

Side Note:

I understand that the world is messy and at times compromises need to be made. But it seems a little strange to me that pandoc drops input without warning the user. Maybe --trace or --verbose could be put to work in order to provide the user with some sort of notification that pandoc intentionally threw away some input it was given.

nkalvi commented 8 years ago

@dfc From what I understood, <h1 class="title">This is a L1 Heading Title</h1> is considered

is just a visible manifestation on the page of the title.

i.e. Just a repeat of the <title>This is a L1 Heading Title</title> in the <head> of HTML document, and thus ignored.

I still feel it can be handled better (see my earlier comments). BTW, obtuse? - I like it :smile:

jgm commented 8 years ago

Douglas, you're right that the <h1 class="title"> isn't used to populate the "title" field of metadata on reading HTML. The <title> element of the head is used for that, and I don't think we'd want to override that. Perhaps if <title> is empty it would make sense to do this, I don't know.

I'm half tempted just to remove the special treatment of <h1 class="title">. You are not the first to ask about it. On the other hand, it's hard to know how many people might be relying on this behavior.

Jmuccigr commented 8 years ago

FWIW, not me. :-) Remove away.

ondras commented 8 years ago

For the record, I also ran into this issue. Removing this behavior would make my life easier in this particular case.