Open GoogleCodeExporter opened 9 years ago
Attachment by jrm0815 on 04.01.2008 00:52:22 +0100: Breaker.zip, size 19917
bytes
Download: http://java.net/jira/secure/attachment/27341/Breaker.zip
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
Attachment by jrm0815 on 22.01.2008 00:53:41 +0100: Breaker_updated.zip, size
20109 bytes
Download: http://java.net/jira/secure/attachment/27342/Breaker_updated.zip
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
peterbrant wrote on 28.12.2007 15:44:11 +0100:
Sounds great to me. It should be straightforward to make the text breaking
algorithm more flexible (e.g. make a Breaker implementation a property of the
SharedContext or something) (assuming that the break algorithm is still
fundamentally a first-fit algorithm).
I know next to nothing about non-western scripts so I think we'd look to you
for the actual algorithm, but I can definitely help with any necessary
scaffolding, documentation, etc.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
jrm0815 wrote on 03.01.2008 01:19:15 +0100:
It was pretty trivial to modify the Breaker to implement an interface, and then
add the ability to set an implementation of the interface into the shared
context- per Peter's suggestion. At that point a custom break such as ours can
be used by implementing the interface and setting it into the shared context.
of course the inline boxing needs to reference the shared context's getter
instead of using the Breaker statically. This may be enough- with some added
doc or example on how one might implement another solution.
I've not fully digested the breakFirstLetter method. I've implemented a
solution to our problem using icu4j. I don't know if this particular package is
something anyone else wants. I will look into implementing an alternate that
just uses Java's line break measurer. (icu4j is smarter- doesn't wrap on '.'
within words e.g. or i.e. for example- and handles some locale issues better.)
At first blush it appears that the LBM breaks early- may be differences in how
the TextLayouts measure the bounds? But I really haven't tested enough to
really say this. Chinese is at least wrapping now.
At present I don't know how to submit my changes- as to date I've just been a
user. I will look into how to submit/propose these changes.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
peterbrant wrote on 03.01.2008 01:47:53 +0100:
Very cool.
If you attach your changes as a patch either to this bug or to a mailing list
post, I'll apply them. This is something that has bothered me for a while and
it's great that you're looking at it.
Regarding breakFirstLetter(), it's used with the :first-letter pseudo-element.
That list of character classes is taken from the spec and means that the first
letter may actually be several characters (e.g. with "This is quoted", :first-
letter matches "T). (Actually to be really pedantic, :first-letter could span
multiple boxes, but we don't handle that.)
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
jrm0815 wrote on 04.01.2008 00:52:22 +0100:
Created an attachment (id=35)
Files modified to allow external injection of a different breaker
implementation. An example implementation of the Breaker that uses a
LineBreakMeasurer.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
jrm0815 wrote on 04.01.2008 01:00:01 +0100:
Here's a take on this. I attached a possible solution. I included an
implementation of the Breaker (TextBreaker implementation - given the
refactoring to interface). This implementation also externalizes the
BreakIterator such that it would easy to use with IBM's ICU4J component or the
Java version or whichever.
The implementation I think is listed in one of my packages- please move it
wherever you like (or just remove it if you don't like it). It's a little fugly
in its handling of the potential differences between IText rendering and AWT
rendering. I don't use iText so maybe it's too much? I just wanted to account
for it somehow. Basically, I am just using the AWT fonts to do the breaking and
making sure the reconciles with the renderer's view of things.
I left the first character breaker as implemented in the Breaker- in fact I just
delegated it. This could be a little smoother if the Breaker's logic was more
extensible (separate out the pre/no-wrap logic from the breaking logic from the
unbreakable logic. But I'm not sure this should be extensible.
Thanks.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
peterbrant wrote on 04.01.2008 19:54:08 +0100:
Thanks. I'm traveling currently, but I'll take a look at this on Monday.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
jrm0815 wrote on 22.01.2008 00:50:48 +0100:
What I did worked great for CJK - but since the break routine is box by box and
not line by line- my solution has a bug. It will not handle boxes that should
not be broken at the line's end correctly (it breaks them anyway). This is
resolvable by simply using the LineBreakMeasurer's signature that do require the
next word (in entirety). However, in testing this fix- I came upon other
issues.
When the line width is too small to break a box at all, trying to push off the
break to next line forces an infinite loop. To avoid this scenario- when asked
to break the text a second time- we find the smallest break point possible.
This is not ideal as I am using a member variable to see if the previous text
matches the current text. (That could be trouble as the text is no longer
arbitrary, repeated phrases could cause problems). It would be better if I
could determine the full available width. As I see it in laying out the in line
boxing we anchoring things from the left. I can find the left offset, but I
cannot determine the functional right limit. In all cases I saw in debugging,
the line width was essentially the canvas width -2*the X offset (available from
the BoxFormatContext). I doubt this is generally valid.
Is there a way I could determine the total available width for a line?
In testing I realized a couple of limitations with determining the wrapping for
a single box. Basically, if the full lexical unit resides in two separate
boxes- there is no way to keep the unit together in wrapping. For example- the
xhmtl contain this snippet "<u>The Title</u>." could break between the "e" and
".". That's a little problematic as the period is not part of the title, but
also should not get orphaned. I think this issue presents a bigger problem for
wrapping text correctly, and it's not readily apparent to me how this would fit
into the current scheme.
All told, I found it better to go with the whitespace breaking routine when
possible (this also has the upside of fewer impacts to existing stuff) and I am
only using the LineBreakMeasurer in cases where the white space break routine
fails (for CJK this is pretty much all the time). I'm far from an expert on
CJK, which is why I chose to delegate to ICU4J. I'm sure there are still some
issues with this approach- but at least it gets me OK results most of the time.
Hopefully, some real CJK experts can provide a better solution. I'll keep it
in mind as I continue to learn through supporting CJK for my day job.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
jrm0815 wrote on 22.01.2008 00:53:46 +0100:
Created an attachment (id=38)
Updated to use whitespace breaking when possible
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
peterbrant wrote on 22.01.2008 01:13:59 +0100:
We know the total available width of the line because we have to reset the
available width to that when we start a new line. If it would be useful to
you, we can definitely sling it around as part of the LineBreakContext.
This whole "move to the next line if things don't fit" thing needs to go away
anyway though (see <a href="http://java.net/jira/browse/XHTMLRENDERER-153"
title="Line breaking and floats [R9
deferred]"><strike>XHTMLRENDERER-153</strike></a>).
Our current handling of "<u>The Title</u>." is undeniably ugly. We certainly
should/could check the following box and wrap if the minimum break doesn't fix
(taking into account the white-space property). This would be around line 206
in InlineBoxing.
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
pdoubleya wrote on 20.07.2008 19:30:29 +0200:
Set target R9
Original comment by pdoubl...@gmail.com
on 16 Feb 2011 at 9:55
Original issue reported on code.google.com by
pdoubl...@gmail.com
on 16 Feb 2011 at 9:55