perl-pod / pod-simple

Framework for Parsing and Formatting POD
http://search.cpan.org/dist/Pod-Simple/
44 stars 60 forks source link

Invalid HTML produced by Pod::Simple::HTML #155

Closed PhilterPaper closed 1 year ago

PhilterPaper commented 1 year ago

I have been using pod2html to produce HTML from Perl modules. Once reported errors in the POD are fixed, it produces acceptable HTML when viewed in a browser, but a long list of complaints when thrown against an HTML validator such as https://validator.w3.org. So, I've been trying to cut over to Pod::Simple::HTML, in hopes of cleaner HTML, but with worse results (even more complaints from the validator).

The majority of problems appear to be related to doing lists. Apparently, =over through =back creates a definition list in HTML, with everthing =item in the <dt> and either an empty <dd> (or missing altogether for the last =item in a list of them!). I guess that the first question I should ask is if I'm using POD/PSH correctly to generate lists of things, as bulleted/unordered, numbered/ordered, definition, and simple lists. Or can they just not be done?

Secondly, does =over always produce a definition list in order to get indentation? It seems to really like producing them, although sometimes it seems to produce <ul> lists, especially if I start the item text with a star.

=over

=item 0. Fill text

=item 1. Stroke text (outline)

=item 2.  Fill, then stroke text

=back

produces

<dl>
<dt><a name="0._Fill_text"
>0. Fill text</a></dt>

<dd>
<dt><a name="1._Stroke_text_(outline)"
>1. Stroke text (outline)</a></dt>

<dd>
<dt><a name="2._Fill,_then_stroke_text"
>2. Fill, then stroke text</a></dt>
<dl>

which will render in a browser, but a validator complains about it (especially the missing <dd> on the last item, but not any of the missing </dd>s). Is there a better formatting to get the effect I want (preferably with bolded numbers and normal text)? Note that an ordered list may not be appropriate here, as the numbers start with 0. Of course, this POD might be read with other purposes than producing HTML, so it should still be legal POD!

There are some other issues, such as making each line an anchor, which gives problems of duplicate names when two list items (or a list item and a heading) in the document happen to be the same text. Also, the validator wants name= to be id=. Finally, in the "index" area the list items do not end with </li> (although, strangely, the validator doesn't complain).

haarg commented 1 year ago
  • To get a bulleted list, with the standard bullets, without bolding the entire line. Should this always work: =item * This is Item 1?

  • To get a numbered list, with the numbers bolded. I would hope =item 1. This is Item 1 would work, but it doesn't seem to.

  • To get a definition list with the term bolded, and the rest normal text, should =item myterm my definition work? I can't figure out how to do this.

You need to separate the list content in a separate paragraph from the list "marker".

=head1 Ordered

=over

=item 1.

First

=item 2.

Second

=back

=head1 Unordered

=over

=item *

First

=item *

Second

=back

=head1 Definition

=over

=item first

First

=item second

Second

=back
  • To get a simple list with no bolding and no marker -- is that possible?

Pod doesn't have this concept and neither does HTML, so I'm not sure what you are expecting.

which will render in a browser, but a validator complains about it (especially the missing <dd> on the last item, but not any of the missing </dd>s). Is there a better formatting to get the effect I want (preferably with bolded numbers and normal text)? Note that an ordered list may not be appropriate here, as the numbers start with 0. Of course, this POD might be read with other purposes than producing HTML, so it should still be legal POD!

If you want to produce a definition list with numeric topics, you can prefix them with a Z<> formatting code.

=over

=item Z<>0.

Fill text

=item Z<>1.

Stroke text (outline)

=item Z<>2.

Fill, then stroke text

=back

There are some other issues, such as making each line an anchor, which gives problems of duplicate names when two list items (or a list item and a heading) in the document happen to be the same text. Also, the validator wants name= to be id=. Finally, in the "index" area the list items do not end with </li> (although, strangely, the validator doesn't complain).

Omitting ending tags in these contexts is allowed by the HTML spec.

Pod::Simple::HTML is rather old and produces HTML4 output, so a modern validator won't like it much. But the only thing I'm seeing here that I'd say is a bug is producing duplicate name attributes.

You'll get better output from Pod::Simple::XHTML. It will close all of its tags, and use unique ids rather than name. It also won't add ids to definition lists by default, although it has an option (anchor_items) to do so.

PhilterPaper commented 1 year ago

Thanks for the response. Lots of things there for me to try out. I will have to give Pod::Simple::XHTML a try, and clean up all my lists now that I know what is expected to produce certain kinds of lists.

Pod doesn't have [simple list] and neither does HTML, so I'm not sure what you are expecting.

It's something that some markup languages have, where the list is indented, no marker is used, and there is extra vertical space between list items (as done with other lists). It can be produced in HTML by setting list-style-type: none in CSS for an unordered list.

<sl>
<li> Item 1.
<li> Item 2.
</sl>

produces

    Item 1.

    Item 2.

Omitting ending tags in these contexts is allowed by the HTML spec.

Sometimes it complains about missing end tags, and sometimes it doesn't. It may depend on the HTML version it's being validated to? I'm trying to get everything over to HTML 5 at this time, and may have a mixture of things (formerly XHTML). If 1) the W3C validator isn't going to complain, and 2) it's a pain for Pod::Simple to add end tags, I won't worry about it, so long as it validates cleanly.

PhilterPaper commented 1 year ago

OK, I played with it a bit, and produced documentation from Issue155.pod.txt (remove .txt extension) in 4 different ways (remove .txt extensions from all):

  1. perldoc (displays as expected)
  2. pod2html (output Issue155A.html.txt)
  3. like 2., but manually change from XHTML to HTML5 (Issue155B.html.txt)
  4. produced with Pod::Simple::XHTML (Issue155C.html.txt)

Isuue155.pod contains comments on manual changes made to HTML file, display results, validation results (validator.w3.org). It seems to be a common problem that some tags contain unwanted trailing / ("self closing"), and have <p> and <dl> tags directly under a <ul> (after an =over directive without an immediate =item). The Z<> directives seem to work sometimes (render as definition list), not not others (render as ordered list). While they seem to render OK, I really want to get rid of validator errors.

I'm hoping that you'll take a look at these examples, and either spot errors I made, or better ways to use these tools that get around the problems I'm reporting. Or, they could be bugs in PSX. I provide a tool in PDF::Builder that currently uses pod2html to create a full set of HTML files for my .pm files, and am willing to change to PSX if it will provide better formatting. I also wrap some PHP around them and provide online documentation on my website.

Once I get clean HTML that displays nicely and validates cleanly, I can look at polishing it up, such as CSS to bold <dt> and list markers, and possibly put the <dd> on the same line as <dt>. I really don't like the definition being on a separate line from the term, and thought I recalled seeing ways to have the definition being on the same line, except if the term exceeds a certain length, in which case it breaks as at present, but that might have been LaTeX or IBM GML and not HTML.

haarg commented 1 year ago

The self-closing / is allowed (as a noop) on void tags by the HTML5 spec. If the validator is complaining about that, it's being overly pedantic. Pod::Simple::XHTML will include the trailing / on some elements in its header, which should be expected.

I'm not clear what your complaint is about Z<>. The entire point of using it is force the list to be handled as a definition list rather than an ordered list, which is what happens with all of the formatters you tried. If you want an ordered list, don't include it. Pod does not have a concept of ordered lists that don't start with 1 and increment by 1.

Having an =over section with text rather than =items producing a <ul> element does seem like a bug. That it should probably result in a <blockquote>.

haarg commented 1 year ago

156

PhilterPaper commented 1 year ago

Thank you for addressing the incorrect <ul> issue. That's probably the biggest problem here, that I can't work around. Notice that perldoc is happy to indent the paragraphs, while pod2html and PSX try to use <ul> to do it, resulting in invalid HTML. A <blockquote> is probably as good a way to handle this as any, although you will of course have to look ahead to see if an =item is coming next, rather than some sort of block construct. Also don't forget that arbitrary further constructs (e.g., <dl>) need to be handled properly. You might want to consider enclosing the entire =over block in a <div> with a left margin of the desired indentation (just be sure that an =item isn't upcoming next).

If the validator is complaining about that, it's being overly pedantic.

It is complaining, and it is probably overly pedantic. Unless someone can show that an explicit closing tag (or / self-closer) is required for the desired HTML level, I think you should just leave out the /.

I'm not clear what your complaint is about Z<>.

If we're using perldoc as something of a reference standard for POD, it appears to ignore (?) Z<> and produce an <ol> type structure, rather than <dl>-like. Is Z<> official POD? For all the other utilities, it appears to work as intended.

haarg commented 1 year ago

A <blockquote> is probably as good a way to handle this as any, although you will of course have to look ahead to see if an =item is coming next, rather than some sort of block construct. Also don't forget that arbitrary further constructs (e.g., <dl>) need to be handled properly. You might want to consider enclosing the entire =over block in a <div> with a left margin of the desired indentation (just be sure that an =item isn't upcoming next).

The type of =over block is already understood by the parser. Having content before the first =item directives inside an =over violates the Pod spec. The formatter was just making a bad choice for how to output bare =over sections. Including CSS rules in the output would make it harder to use as part of other tools.

It is complaining, and it is probably overly pedantic. Unless someone can show that an explicit closing tag (or / self-closer) is required for the desired HTML level, I think you should just leave out the /.

Pod::Simple::XHTML is the current best formatter for producing HTML. It obviously is going to produce XHTML compatible output. The only part including self closing tags is the header, which can be omitted or replaced if desired. I really don't see any point in changing this. pod2html is part of perl core. It uses Pod::Simple for parsing, but has its own formatter. Any issues with it should be reported to the perl5 issue tracker.

It would be possible to create a new HTML5 formatter. It could have a different header, and possibly produce a <nav> for the index. But that's a new formatter, not a bug fix.

If we're using perldoc as something of a reference standard for POD, it appears to ignore (?) Z<> and produce an <ol> type structure, rather than <dl>-like. Is Z<> official POD? For all the other utilities, it appears to work as intended.

All list types are handled essentially the same by perldoc (which uses the Pod::Man formatter by default). A left aligned "topic" and an indented body. If the topic is short enough, the body is displayed on the same line. The only difference in output between a numeric list and a definition list is that definition list topics tend to be longer, so they are usually displayed on a separate line. Z<> is a formatting code that produces no output. Its use with lists is just to prevent them from being interpreted as an ordered list, as those must match a rather strict format. This is covered in perlpod and perlpodspec.

PhilterPaper commented 1 year ago

The type of =over block is already understood by the parser. Having content before the first =item directives inside an =over violates the Pod spec.

Even though it appears to work OK with perldoc? (Is that a bug on perldoc's part?) So, is there any way to indent a section (paragraphs and lists) to make it clear that it is under a heading, without using =over? What does this bullet from perlpod mean?

The first thing after the "=over" command should be an "=item", unless there aren't going to be any items at all in this "=over" ... "=back" region.

It's rather unfortunate that they used =over instead of =list, if they meant it only to mean a list of some sort, but what's done is done.

Add: I see that perlpodspec does mention that a <blockquote> could be used instead of an ordered, unordered, or definition/description list. It would seem to me that it would be reasonable to allow the mixing of plain paragraphs (blockquotes) with various lists, although it might be difficult once you've started a list to decide whether a bare paragraph is meant to be part of the current <li> or marks the actual end of the list. Would it be allowable to require a second =over and =back to start/end a list within an indented section? Otherwise, could it just do paragraphs after =over as blockquotes, until an =item is seen, and leave it to the author to explicitly end a list (=back) and restart it with =over if they want a paragraph that is not part of a list item (but still indented), and want to resume with further list items after that.

haarg commented 1 year ago

The type of =over block is already understood by the parser. Having content before the first =item directives inside an =over violates the Pod spec.

Even though it appears to work OK with perldoc? (Is that a bug on perldoc's part?)

Most of the tools, including perldoc, are rather permissive. They will try to make the best of content, even if it is not valid. But also, perldoc will complain about errors like this.

So, is there any way to indent a section (paragraphs and lists) to make it clear that it is under a heading, without using =over?

Content after a heading is implicitly "under" that heading, just like it is in HTML. Indenting is a formatting choice. If you want to force the issue, you can use an =over, but that still may not be formatted exactly like you want. If you want to include paragraphs and lists, you need either separate or nested =over blocks. This is again, just like HTML.

What does this bullet from perlpod mean?

The first thing after the "=over" command should be an "=item", unless there aren't going to be any items at all in this "=over" ... "=back" region.

This is essentially the same thing I was saying, phrased differently. An =over section should either have normal pod content in it, or it should have =items and their content. It can't have both. If you want both, you need another =over to contain the =items.

=over is the equivalent of either <dl>, <ul>, <ol>, or <blockquote>. Which one of these it is equal to is determined by its content. But it can only be one of them.

PhilterPaper commented 1 year ago

=over is the equivalent of either <dl>, <ul>, <ol>, or <blockquote>. Which one of these it is equal to is determined by its content. But it can only be one of them.

Fair enough, if the way to mix them would be to have separate =over/=back blocks for each kind of indented content you want. Do I then understand that PSX does not currently correctly implement the blockquote for an indented paragraph, or was that only due to mixing paragraphs and lists?

PhilterPaper commented 1 year ago

Using separate blocks as above seems to cure the problem of p and dl under a ul, but I found something else.

=over

=item 0. 

Fill text

=item 1.

Stroke text

=back

gives a POD error for the second one (item 1.) that it is

Expected text after =item, not a number

It produces a <dl> (not an <ol start=0> as might be hoped). Note that Z<> was not used here, in hopes of an ordered list. Is there anything that I can do, beyond using Z<> to force a dl? Since it had already decided to start a dl, having seen "0." (isn't that a number?), shouldn't it just quietly accept "1." as a dt?

This isn't a show-stopper, and I can work around it (use Z<>), but it is a minor annoyance.

PhilterPaper commented 1 year ago

Oops, spoke too soon. The W3C validator is still finding <p> under <ul> (both pod2html and PSX). While a browser displays it OK, the validator is unhappy. The POD to produce this is:

=head2 Indented paragraphs

=head3 new

    $annotation = PDF::Builder::Annotation->new()

=over

Returns an annotation object (called from $page->annotation()). These two
paragraphs should be indented.

It is normally I<not> necessary to explicitly call this method (see examples).
The list of items should be indented under these paragraphs.

=back

=over

=item Item 1

Some text about item 1.

=item Item 2

Some text about item 2.

=back

=cut

and produces HTML

<h2 id="Indented-paragraphs">Indented paragraphs</h2>

<h3 id="new1">new</h3>

<pre><code>    $annotation = PDF::Builder::Annotation-&gt;new()</code></pre>

<ul>

<p>Returns an annotation object (called from $page-&gt;annotation()). These two paragraphs should be indented.</p>

<p>It is normally <i>not</i> necessary to explicitly call this method (see examples). The list of items should be indented under these paragraphs.</p>

</ul>

<dl>

<dt>Item 1</dt>
<dd>

<p>Some text about item 1.</p>

</dd>
<dt>Item 2</dt>
<dd>

<p>Some text about item 2.</p>

</dd>
</dl>

Should that be

<h2 id="Indented-paragraphs">Indented paragraphs</h2>

<h3 id="new1">new</h3>

<pre><code>    $annotation = PDF::Builder::Annotation-&gt;new()</code></pre>

<blockquote>Returns an annotation object (called from $page-&gt;annotation()). These two paragraphs should be indented.</blockquote>

<blockquote>It is normally <i>not</i> necessary to explicitly call this method (see examples). The list of items should be indented under these paragraphs.</blockquote>

<dl>

<dt>Item 1</dt>
<dd>

<p>Some text about item 1.</p>

</dd>
<dt>Item 2</dt>
<dd>

<p>Some text about item 2.</p>

</dd>
</dl>

(no list, and turn paragraphs into block quotes)? I think you mentioned something about that earlier, but may have only been referring to pod2html.

Also, is there any real reason to keep self-closing /'s on <meta> etc. tags? The validator complains about them -- are there any levels of HTML that the validator requires them on? It's easy enough for my code to change /> to >, but it would be nice not to have to.

haarg commented 1 year ago

Do I then understand that PSX does not currently correctly implement the blockquote for an indented paragraph, or was that only due to mixing paragraphs and lists?

The released version of Pod::Simple::XHTML still incorrectly produces <ul> for an indented paragraph. The fix to use <blockquote> for this was merged, but hasn't been released yet.

Using separate blocks as above seems to cure the problem of p and dl under a ul, but I found something else. ...

Quoting perlpodspec:

An "=over" ... "=back" region containing only m/\A=item\s+\d+.?\s\z/ paragraphs, each one (or each group of them) followed by some number of ordinary/verbatim paragraphs, other nested "=over" ... "=back" regions, "=for..." paragraphs, and/or "=begin"..."=end" codes. Note that the numbers must start at 1 in each section, and must proceed in order and without skipping numbers. ... The "=item [text]" paragraph should not match m/\A=item\s+\d+.?\s\z/ or m/\A=item\s+*\s\z/, nor should it match just m/\A=item\s\z/.

The parser should probably complain about the =item 0, but either way it is not a valid numeric list. And if it has already decided that it is a definition list, having plain numbers as the text is not allowed.

Also, is there any real reason to keep self-closing /'s on <meta> etc. tags? The validator complains about them -- are there any levels of HTML that the validator requires them on? It's easy enough for my code to change /> to >, but it would be nice not to have to.

The module is called Pod::Simple::XHTML. It produces XHTML. XHTML requires the self closing /.

PhilterPaper commented 1 year ago

OK, I'm looking forward to the next update to do the <blockquote> fix. In the meantime, I will use PSX and ignore the validation errors for now. I'll close this issue (reopen it if you need to hold it open for some reason).