Closed michaelrsweet closed 19 years ago
Original reporter: Michael Sweet
Hmm, how do current browsers handle this? Breaking without whitespace doesn't strike me as "the right thing to do", and table formatting has historically been "hard to do right" - I don't want to mess with the current code if I can avoid it.
I'll try playing here with various browsers and also look at the HTML spec to see if the W3C has any guidance...
Original reporter: Michael Sweet
It's how mozilla handled the test case ... and this one was tough to find:
htmldoc wasn't handling www.nytimes.com at all well. I tried to run a saved copy thru amaya and then looked at the nicely formatted result w/ htmldoc, at which point several of the problems went away.
I finally hand-formatted the offending section of the nytimes html (for readability, indenting ...) line by line until I found that htmldoc could handle it - the key difference was that the original html was <img ...><comment ...><img ...> -- inside of a table.
parse_paragraph() was treating those as unaligned images (they were), but since there was no whitespace, the first loop in the handle-inline-images section wasn't exiting until the 2'nd image was evaluated, doubling the width of that table column and goofing up the format of the whole page.
the hand-formatted html was:
<img>
<comment>
<img>
-- just the newlines caused the whitespace flag to be set and so that loop was exited after handling the 1'st image.
I too wondered about the correctness of the fix, but mozilla is my referee. I can't claim this one fix will cause nytimes.com to be rendered correctly but it seemed worth mentioning. w/ the fixes I've made it now renders that page well.
Mark
Original reporter: Michael Sweet
Mark,
I've applied your change and a similar set of changes in get_cell_size() for 1.8.24, and those changes will be up-ported to the 1.9 code after 1.8.24 is released.
That said, I'm still not able to get a clean rendering of the nytimes.com page. Other pages like cnn.com seem to work perfectly, so I'm not sure if the change has really helped at all...
Original reporter: Michael Sweet
Michael,
I found I hadn't really fixed it as I'd hoped -- here's a fragment from parse_paragraph w/ the latest (hopefully final) version - the only thing I can see that's wrong w/ my implementation is that we add borderspace into temp_width before deciding to bail out: temp_width += 2 * borderspace;
but it should be a small error ...
the previous fix falsely inflated width by the width of temp which caused it to exceed format_width.
here's the sequence of code w/ the new implementation noted. the main changes are doing the check before: prev = temp; temp = temp->next; temp_width += prev->width;
rather than later, and the use of a bool too_wide.
[ my apologies for the length of this fragment -- I thought it easier to get the context of the change. my additions are noted w/ '+' @ the left - note that deletions aren't marked. ]
int too_wide = FALSE;
. . .
// Make stupid compiler warnings go away (if you can't put // enough smarts in the compiler, don't add the warning!) offset = 0.0f; temp_width = 0.0f; temp_height = 0.0f; lineptr = NULL; linex = 0.0f; linewidth = 0.0f;
while (flat != NULL) { start = flat; end = flat; width = 0.0;
while (flat != NULL) { // Get fragments... temp_width = 0.0; temp = flat; whitespace = 0;
while (temp != NULL && !whitespace) {
too_wide = FALSE; if (temp->markup == MARKUP_NONE && temp->data[0] == ' ') { if (temp == start) temp_width -= _htmlWidths[temp->typeface][temp->style][' '] * _htmlSizes[temp->size]; else if (temp_width > 0.0f) whitespace = 1; } else whitespace = 0;
if (whitespace)
break;
if (temp->markup == MARKUP_IMG)
{ if ((border = htmlGetVariable(temp, (uchar )"BORDER")) != NULL) borderspace = atof((char )border); else if (temp->link) borderspace = 1; else borderspace = 0;
borderspace *= PagePrintWidth / _htmlBrowserWidth;
temp_width += 2 * borderspace;
}
/* if we're at/beyond the known limits of our working width, break
out. don't just rely on a MARKUP_BR to do this!
*/
if ( ( temp->width + temp_width ) >= format_width ) {
too_wide = TRUE;
break;
}
prev = temp;
temp = temp->next;
temp_width += prev->width;
if (prev->markup == MARKUP_BR)
break;
}
if ((width + temp_width) <= format_width) { width += temp_width; end = temp; flat = temp;
+------------vvvvvvvvvvv
==================================================================
attached is a very simple test image w/ its one image. they should render one above the other, not side by side.
Mark
-- Mark Pilon Konica Minolta Printing Solutions U.S.A.
Original reporter: Michael Sweet
Mark, can you test against 1.8.24rc1 and let me know if the problem still exists.
If it does, can you provide an updated patch against the 1.8.24rc1 sources?
Thanks!
Original reporter: Michael Sweet
This STR has not been updated by the submitter for two or more weeks and has been closed as required by the HTMLDOC Configuration Management Plan. If the issue still requires resolution, please re-submit a new STR.
"too_wide.html":
<table width=200>
<tr>
<td>
<img src=us_cnn_img.gif><img src=us_cnn_img.gif>
</td>
</tr>
</table>
Version: 1.8-current Original reporter: Michael Sweet
Michael,
I added a fix to parse_paragraph to handle the case of inlined images w/o any whitespace between them -- parse_paragraph was putting them on the same line, even though their combined width would exceed format_width.
this most frequently happened in a table row w/ a set width, that == to the width of the images to go in the row, blowing out the side of the row.
the following is from htmldoc 1.9: [ps-pdf.cxx v.1.99], my patch is actually to an earlier version w/o margins ...
line 4349: if (temp->element == HD_ELEMENT_IMG) { if ((border = htmlGetVariable(temp, (uchar )"BORDER")) != NULL) borderspace = atof((char )border); else if (temp->link) borderspace = 1; else borderspace = 0;
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv / if we're at/beyond the known limits of our working width, break out. KM-PATCH. / if ( temp_width >= format_width ) { break; } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
since I'm working from 1.8xx format_width would probably be replaced w/ margins->width() ... ?
Have a good 4th,
Mark