michaelrsweet / htmldoc

HTML Conversion Software
https://www.msweet.org/htmldoc
GNU General Public License v2.0
208 stars 47 forks source link

Suggested patch to parse_paragraph() #6

Closed michaelrsweet closed 19 years ago

michaelrsweet commented 20 years ago

Version: 1.8-current Original reporter: Michael Sweet

Michael,

I added a fix to parse_paragraph to handle the case of inlined images w/o any whitespace between them -- parse_paragraph was putting them on the same line, even though their combined width would exceed format_width.

this most frequently happened in a table row w/ a set width, that == to the width of the images to go in the row, blowing out the side of the row.

the following is from htmldoc 1.9: [ps-pdf.cxx v.1.99], my patch is actually to an earlier version w/o margins ...

line 4349: if (temp->element == HD_ELEMENT_IMG) { if ((border = htmlGetVariable(temp, (uchar )"BORDER")) != NULL) borderspace = atof((char )border); else if (temp->link) borderspace = 1; else borderspace = 0;

      borderspace *= PagePrintWidth / _htmlBrowserWidth;

      temp_width += 2 * borderspace;
}

    prev       = temp;
    temp       = temp->next;
    temp_width += prev->width;

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv / if we're at/beyond the known limits of our working width, break out. KM-PATCH. / if ( temp_width >= format_width ) { break; } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    if (prev->element == HD_ELEMENT_BR)
  break;
  }

since I'm working from 1.8xx format_width would probably be replaced w/ margins->width() ... ?

Have a good 4th,

Mark

michaelrsweet commented 20 years ago

Original reporter: Michael Sweet

Hmm, how do current browsers handle this? Breaking without whitespace doesn't strike me as "the right thing to do", and table formatting has historically been "hard to do right" - I don't want to mess with the current code if I can avoid it.

I'll try playing here with various browsers and also look at the HTML spec to see if the W3C has any guidance...

michaelrsweet commented 20 years ago

Original reporter: Michael Sweet

It's how mozilla handled the test case ... and this one was tough to find:

htmldoc wasn't handling www.nytimes.com at all well. I tried to run a saved copy thru amaya and then looked at the nicely formatted result w/ htmldoc, at which point several of the problems went away.

I finally hand-formatted the offending section of the nytimes html (for readability, indenting ...) line by line until I found that htmldoc could handle it - the key difference was that the original html was <img ...><comment ...><img ...> -- inside of a table.

parse_paragraph() was treating those as unaligned images (they were), but since there was no whitespace, the first loop in the handle-inline-images section wasn't exiting until the 2'nd image was evaluated, doubling the width of that table column and goofing up the format of the whole page.

the hand-formatted html was:

<img>
<comment>
<img>

-- just the newlines caused the whitespace flag to be set and so that loop was exited after handling the 1'st image.

I too wondered about the correctness of the fix, but mozilla is my referee. I can't claim this one fix will cause nytimes.com to be rendered correctly but it seemed worth mentioning. w/ the fixes I've made it now renders that page well.

Mark

michaelrsweet commented 20 years ago

Original reporter: Michael Sweet

Mark,

I've applied your change and a similar set of changes in get_cell_size() for 1.8.24, and those changes will be up-ported to the 1.9 code after 1.8.24 is released.

That said, I'm still not able to get a clean rendering of the nytimes.com page. Other pages like cnn.com seem to work perfectly, so I'm not sure if the change has really helped at all...

michaelrsweet commented 20 years ago

Original reporter: Michael Sweet

Michael,

I found I hadn't really fixed it as I'd hoped -- here's a fragment from parse_paragraph w/ the latest (hopefully final) version - the only thing I can see that's wrong w/ my implementation is that we add borderspace into temp_width before deciding to bail out: temp_width += 2 * borderspace;

but it should be a small error ...

the previous fix falsely inflated width by the width of temp which caused it to exceed format_width.

here's the sequence of code w/ the new implementation noted. the main changes are doing the check before: prev = temp; temp = temp->next; temp_width += prev->width;

rather than later, and the use of a bool too_wide.

[ my apologies for the length of this fragment -- I thought it easier to get the context of the change. my additions are noted w/ '+' @ the left - note that deletions aren't marked. ]

+------------vvvvvvvvvvv

==================================================================

attached is a very simple test image w/ its one image. they should render one above the other, not side by side.

Mark

-- Mark Pilon Konica Minolta Printing Solutions U.S.A.

michaelrsweet commented 19 years ago

Original reporter: Michael Sweet

Mark, can you test against 1.8.24rc1 and let me know if the problem still exists.

If it does, can you provide an updated patch against the 1.8.24rc1 sources?

Thanks!

michaelrsweet commented 19 years ago

Original reporter: Michael Sweet

This STR has not been updated by the submitter for two or more weeks and has been closed as required by the HTMLDOC Configuration Management Plan. If the issue still requires resolution, please re-submit a new STR.

michaelrsweet commented 20 years ago

"too_wide.html":


<table width=200>
<tr>
<td>
<img src=us_cnn_img.gif><img src=us_cnn_img.gif>
</td>
</tr>
</table>