No line wrapping for - CJK

GoogleCodeExporter commented 9 years ago

*** This issue was imported from http://java.net/jira/browse/XHTMLRENDERER-206

It was reported by jrm0815 on 21.12.2007 18:18:51 +0100 and last updated in the 
previous bug tracker on 20.07.2008 19:30:29 +0200

Found in
Operating System: All
Platform: All

The priority for this issue at migration was Major.
The original issue had attachments to it; see comments below.

Original description: 
Hi, we are using Flying Saucer to do XHTML rendering for reporting.  We are
beginning to do a lot of CJK reports.  I discovered that the Breaker class seems
to break only on WhiteSpace.  It is not at all uncommon for CJK text to not
contain much/any white space.  This renders most of the text as unbreakable
using the existing algorithm.  A more sophisticated approach would be helpful. 
We use IBM icu4j for some of our line breaking (in other contexts) for CJK- with
very good results.  The IBM line breaker is an improvement over the java breaker
(at least in our limited experience).  

I haven't had time yet to seriously consider how I might modify the breaking
routine.  Perhaps just in the unbreakable scenario- try to force it using a line
breaker, but I'm not sure the best approach for the project.  for our purposes-
we would always want a break to force fit into the given width. Where as
generally given a scrollable- not breaking may be preferred.  

Of course this can always be handled explicitly in the XHTML, but then the
behavior is basically different for CJK versus western languages, I would think
that is not desired.  

We love the project and would have been set back a ton without this tool to
handle these parts of our reports. I will try to see if I can put something
together- in terms of a modified Breaker that works.  I'm just not sure that
what we come up with will be consistent for the project.

Ideas?

Original issue reported on code.google.com by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

Attachment by jrm0815 on 04.01.2008 00:52:22 +0100:  Breaker.zip, size 19917 
bytes
Download: http://java.net/jira/secure/attachment/27341/Breaker.zip

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

Attachment by jrm0815 on 22.01.2008 00:53:41 +0100:  Breaker_updated.zip, size 
20109 bytes
Download: http://java.net/jira/secure/attachment/27342/Breaker_updated.zip

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

peterbrant wrote on 28.12.2007 15:44:11 +0100:
Sounds great to me.  It should be straightforward to make the text breaking 
algorithm more flexible (e.g. make a Breaker implementation a property of the 
SharedContext or something) (assuming that the break algorithm is still 
fundamentally a first-fit algorithm).

I know next to nothing about non-western scripts so I think we'd look to you 
for the actual algorithm, but I can definitely help with any necessary 
scaffolding, documentation, etc.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

jrm0815 wrote on 03.01.2008 01:19:15 +0100:
It was pretty trivial to modify the Breaker to implement an interface, and then
add the ability to set an implementation of the interface into the shared
context- per Peter's suggestion.  At that point a custom break such as ours can
be used by implementing the interface and setting it into the shared context.  
of course the inline boxing needs to reference the shared context's getter
instead of using the Breaker statically.  This may be enough- with some added
doc or example on how one might implement another solution. 

I've not fully digested the breakFirstLetter method.  I've implemented a
solution to our problem using icu4j.  I don't know if this particular package is
something anyone else wants.  I will look into implementing an alternate that
just uses Java's line break measurer. (icu4j is smarter- doesn't wrap on '.'
within words e.g. or i.e. for example- and handles some locale issues better.)

At first blush it appears that the LBM breaks early- may be differences in how
the TextLayouts measure the bounds?  But I really haven't tested enough to
really say this.  Chinese is at least wrapping now.

At present I don't know how to submit my changes- as to date I've just been a
user.  I will look into how to submit/propose these changes.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

peterbrant wrote on 03.01.2008 01:47:53 +0100:
Very cool.

If you attach your changes as a patch either to this bug or to a mailing list 
post, I'll apply them.  This is something that has bothered me for a while and 
it's great that you're looking at it.

Regarding breakFirstLetter(), it's used with the :first-letter pseudo-element.  
That list of character classes is taken from the spec and means that the first 
letter may actually be several characters (e.g. with "This is quoted", :first-
letter matches "T).  (Actually to be really pedantic, :first-letter could span 
multiple boxes, but we don't handle that.)

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

jrm0815 wrote on 04.01.2008 00:52:22 +0100:
Created an attachment (id=35)
Files modified to allow external injection of a different breaker 
implementation.  An example implementation of the Breaker that uses a 
LineBreakMeasurer.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

jrm0815 wrote on 04.01.2008 01:00:01 +0100:
Here's a take on this.  I attached a possible solution.  I included an
implementation of the Breaker (TextBreaker implementation - given the
refactoring to interface). This implementation also externalizes the
BreakIterator such that it would easy to use with IBM's ICU4J component or the
Java version or whichever.  

The implementation I think is listed in one of my packages- please move it
wherever you like (or just remove it if you don't like it).  It's a little fugly
in its handling of the potential differences between IText rendering and AWT
rendering.  I don't use iText so maybe it's too much?  I just wanted to account
for it somehow.  Basically, I am just using the AWT fonts to do the breaking and
making sure the reconciles with the renderer's view of things.

I left the first character breaker as implemented in the Breaker- in fact I just
delegated it.  This could be a little smoother if the Breaker's logic was more
extensible (separate out the pre/no-wrap logic from the breaking logic from the
unbreakable logic. But I'm not sure this should be extensible.

Thanks.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

peterbrant wrote on 04.01.2008 19:54:08 +0100:
Thanks.  I'm traveling currently, but I'll take a look at this on Monday.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

jrm0815 wrote on 22.01.2008 00:50:48 +0100:
What I did worked great for CJK - but since the break routine is box by box and
not line by line- my solution has a bug. It will not handle boxes that should
not be broken at the line's end correctly (it breaks them anyway).  This is
resolvable by simply using the LineBreakMeasurer's signature that do require the
next word (in entirety).  However, in testing this fix- I came upon other 
issues.  

When the line width is too small to break a box at all, trying to push off the
break to next line forces an infinite loop.  To avoid this scenario- when asked
to break the text a second time- we find the smallest break point possible. 
This is not ideal as I am using a member variable to see if the previous text
matches the current text.  (That could be trouble as the text is no longer
arbitrary, repeated phrases could cause problems).  It would be better if I
could determine the full available width.  As I see it in laying out the in line
boxing we anchoring things from the left.  I can find the left offset, but I
cannot determine the functional right limit.  In all cases I saw in debugging,
the line width was essentially the canvas width -2*the X offset (available from
the BoxFormatContext).  I doubt this is generally valid.

Is there a way I could determine the total available width for a line?

In testing I realized a couple of limitations with determining the wrapping for
a single box.  Basically, if the full lexical unit resides in two separate
boxes- there is no way to keep the unit together in wrapping.  For example- the
xhmtl contain this snippet "<u>The Title</u>." could break between the "e" and
".".  That's a little problematic as the period is not part of the title, but
also should not get orphaned.  I think this issue presents a bigger problem for
wrapping text correctly, and it's not readily apparent to me how this would fit
into the current scheme. 

All told, I found it better to go with the whitespace breaking routine when
possible (this also has the upside of fewer impacts to existing stuff) and I am
only using the LineBreakMeasurer in cases where the white space break routine
fails (for CJK this is pretty much all the time).  I'm far from an expert on
CJK, which is why I chose to delegate to ICU4J.  I'm sure there are still some
issues with this approach- but at least it gets me OK results most of the time.
 Hopefully, some real CJK experts can provide a better solution.  I'll keep it
in mind as I continue to learn through supporting CJK for my day job.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

jrm0815 wrote on 22.01.2008 00:53:46 +0100:
Created an attachment (id=38)
Updated to use whitespace breaking when possible

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

peterbrant wrote on 22.01.2008 01:13:59 +0100:
We know the total available width of the line because we have to reset the 
available width to that when we start a new line.  If it would be useful to 
you, we can definitely sling it around as part of the LineBreakContext.

This whole "move to the next line if things don't fit" thing needs to go away 
anyway though (see <a href="http://java.net/jira/browse/XHTMLRENDERER-153" 
title="Line breaking and floats [R9 
deferred]"><strike>XHTMLRENDERER-153</strike></a>).

Our current handling of "<u>The Title</u>." is undeniably ugly.  We certainly 
should/could check the following box and wrap if the minimum break doesn't fix 
(taking into account the white-space property).  This would be around line 206 
in InlineBoxing.

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

pdoubleya wrote on 20.07.2008 19:30:29 +0200:
Set target R9

Original comment by pdoubl...@gmail.com on 16 Feb 2011 at 9:55

sogwhite / flying-saucer

No line wrapping for - CJK #87