snaekobbi / pipeline-mod-braille

ARCHIVED. Please don't make any new issues or pull requests in this repo.
0 stars 0 forks source link

Support non-standard hyphenation #55

Closed bertfrees closed 8 years ago

bertfrees commented 8 years ago

e.g. busstopp => buss-stopp

Because adding soft hyphens at all possible possible break points is not appropriate in this situation, there needs to be the possibility to defer hyphenation and translation to the formatting phase, i.e. when the text is actually broken into lines. We'll call this "inline translation" as opposed to "pre-translation". There needs to be a way to detect during a pre-hyphenation attempt whether a word is hyphenated in a non-standard way, so that the hyphenation/translation can be defered.

Non-standard hyphenation is supported in Libhyphen, but not yet in the Java bindings (jhyphen). One way of achieving this would be to have a function String[] hyphenate(String text, int lineLength) that returns two strings representing the first line and the remaining text. The same function should be added to the org.daisy.pipeline.braille.common.Hyphenator interface.

org.daisy.pipeline.braille.common.TextTransform should probably also get a similar function.

Liblouis translators could perform non-standard hyphenation by somehow adapting/optimizing the specified max line lengths before translation based on the resulting line lengths after hyphenation and translation.

Depends on:

bertfrees commented 8 years ago

@usama49 hi!

I realize now that Libhyphen will not work in your development environment (Windows) so it's a bit hard for you to implement actual non-standard hyphenation rules. But let's start with preparing the system so that it can even handle non-standard hyphenators because it requires quite a different approach from how it works now. The actual hyphenation rules can be implemented later, for now we'll just use a simple mock hyphenator for development, which we will then later replace with something Libhyphen based or something else.

So basically the idea is to extend the Hyphenator interface so that it can be used more generally.

Currently the Hyphenator interface looks like this: you pass it a list of Strings which represent input text segments, and it returns you a list of Strings which is almost identical apart for the hyphenation information. The segmentation must be preserved as well as all regular characters. The only thing the hyphenator will do is insert invisible format characters, notably soft hyphens and zero width spaces.

It is clear that this interface can only be used for a subset of all hyphenators, namely the hyphenators that don't do transformations when breaking words and for which each hyphenation opportunity is independent from the other. In other words: it does not support "non-standard" hyphenation.

I've given this some thought, and my proposal is to generalize our LineBreakingFromStyledText interface from BrailleTranslators to Hyphenators. The interface would have one function hyphenate which would return a LineIterator object:

interface Hyphenator {
    LineIterator hyphenate(Iterable<CSSStyledText> input);
}

input is an Iterable and not a simple String because we want to preserve text segments. The LineIterator object would be a kind of buffer that gives you the lines one by one:

interface LineIterator {
    Iterable<CSSStyledText> nextLine(int limit, boolean force);
    void mark();
    void reset();
}

limit is the maximum number of characters that nextLine may return. The purpose of mark and reset is that you can discard a line and recompute it with a different limit. This is needed because you don't know the actual limit in advance. You will know the available space on a row in terms of braille cells, but that number doesn't simply correspond with the argument you need to pass to the LineIterator because the resulting lines still need to undergo a transformation to braille. The braille translator will use some optimization algorithm to find the best line break by repeatedly calling nextLine and retranslating the result to braille.

An important consequence of this is that the DefaultLineBreaker class, which is intended to be reused by BrailleTranslator implementations, needs to be extended. The way it currently works is that first all possible break opportunities are inserted in the input text as soft hyphens and zero width spaces, in a next step it is fully translated and only then it is passed to the DefaultLineBreaker which does the white space processing and the actual line breaking. This of course doesn't work with non-standard hyphenation. Somehow everything needs to be done simultaneously which makes things much more complicated.

So Ammar what I had in mind for you is that you write a proof-of-concept BrailleTranslator that uses my newly proposed Hyphenator interface to perform the hyphenation. I will set up some unit tests you can use as guidance.

But first I should ask: what do you think of my approach? Does it makes some sense? Does it have any flaws in it or do you have some other ideas?

Also before we continue I'd like to look at what Joel has already done w.r.t. non-standard hyphenation in Dotify. @joeha480 Could you give me a pointer? I've looked but can't find it.

usama49 commented 8 years ago

@bertfrees Hi, I agree with your approach. Please add some tests.

bertfrees commented 8 years ago

OK I've added a little test. Note that it is only intended to clarify the chosen approach, so you know where to start. I can add some more tests if things need to be clarified more.

bertfrees commented 8 years ago

Fixed in https://github.com/daisy/pipeline-mod-braille/commit/b33a9cc3cb3ad2f271b8d54690921ac8404c1096, https://github.com/daisy/pipeline-mod-braille/commit/07d73d9dec026ce6a7b3e9bb730959e4a67f7dc4, https://github.com/daisy/pipeline-mod-braille/commit/1d3ee56f1a1c6aa5a2e12e518858edcfcd169931