For e-mails and URLs, use mathematical hyphenation when splitting over two lines

josteinaj commented 7 years ago

From @josteinaj on September 11, 2015 7:25

Also, the normal hyphenation rules should not be applied as those might insert additional characters.

Copied from original issue: nlbdev/pipeline-mod-nlb#5

josteinaj commented 7 years ago

From @bertfrees on September 11, 2015 7:39

Does that mean a different hyphenation character, and only insert it if the words are too long to fit on a line?

josteinaj commented 7 years ago

From @bertfrees on September 11, 2015 7:57

OK. I will need to implement the hyphenate-character property. And maybe also make a new value for hyphens that means "none + insert a hyphenate-character when a word is split because it's too long to fit on a line".

This is another reason to handle e-mails and URLs in a higher level translator (see also https://github.com/snaekobbi/pipeline-mod-nlb/issues/4). It should probably be done in XSLT because styles need to be added (that are processed later in the process, during layout). The alternative is to implement the DotifyTranslator interface and use it for translation while formatting.

josteinaj commented 7 years ago

Does that mean a different hyphenation character, and only insert it if the words are too long to fit on a line?

Yes.

josteinaj commented 7 years ago

From @bertfrees on October 5, 2015 19:15

Technically, deferring translation of URLs to the formatting phase comes down more or less to the same thing as the hyphenate-character option because we would need to tell Dotify with a style element that a certain text segment is a URL. That is, assuming we don't want to do the URL recognition step twice. For the hyphenate-character option either we need a new dedicated Dotify attribute, or we could use the style element for it as well.

Handling this in XSLT would mean we would also have to move the URL recognition etc. to XSLT, but this better stays in Java IMO. Therefore we should probably use the forthcoming pf:transform Saxon function that works on trees instead of string sequences.

If it's OK to do the URL recognition step twice, we can just defer and don't have to worry about any of this. Note however that when deferring, the quality of the translation could suffer because of loss of context.

josteinaj commented 7 years ago

Is the context the surrounding text, where in the document the URL occurs, what CSS is applied etc? I don't think the translation would change depending on the context.

@KariRudjord: We need to discuss whether or not mathematical hyphenation is appropriate for URLs. Mathematical hyphenation is dot 6, the same as the upper-case indicator, so it assumes that all URLs are case-insensitive, which they usually are on the web but not necessarily.

josteinaj commented 7 years ago

From @bertfrees on October 6, 2015 9:58

Is the context the surrounding text, where in the document the URL occurs, what CSS is applied etc? I don't think the translation would change depending on the context.

Yes. I meant it more in general. It could be that Norwegian braille translation is less context dependent.

In some other braille codes you have rules like: depending on the length (in words) of a bold passage it is indicated differently. If in that case you are going to defer or isolate the translation of certain words you get into trouble quickly.

josteinaj commented 7 years ago

Right, I see. Let's assume that norwegian URLs are context-independent.

@KariRudjord: please correct me if I'm wrong.

josteinaj commented 7 years ago

From @KariRudjord on October 6, 2015 10:44

Yes, the URLs are context-independent.

josteinaj commented 7 years ago

I don't know the status here, should we test this?

josteinaj commented 7 years ago

From @KariRudjord on March 15, 2016 12:3

I think it is a too small thing to be tested now. There are lots of smaller things that could be tested, but it would occupy much time.

josteinaj commented 7 years ago

From @bertfrees on March 15, 2016 12:7

@josteinaj status = to do

josteinaj commented 7 years ago

@KariRudjord Ok, that's fine.

@bertfrees Ok, thanks.

josteinaj commented 7 years ago

From @bertfrees on March 15, 2016 12:14

There's a couple of things that need to happen in mod-braille before you can implement it in mod-nlb. I was planning to do the things in mod-braille this week.

josteinaj commented 7 years ago

Ok, no worries, was just wondering the status since there were no updates on this since october :)

josteinaj commented 7 years ago

From @bertfrees on June 10, 2016 11:55

I'll start by explaining how this can be achieved. It should be relatively easy now after the big change I did for supporting non-standard hyphenation.

Detect in the FromStyledTextToBraille interface of NLBTranslator whether text contains a URL and if so, return the untranslated text. The same principle is used in LiblouisTranslatorJnaImplProvider.java in order to defer translation of non-standard hyphenated words. The untranslated text is already handled fine in block-translate.xsl.
In the LineBreakingFromStyledText interface of NLBTranslator, which is invoked to translate untranslated text during the formatting phase, URLs will be detected a second time. This time they can be handled. Instead of using the grade0Translator, you'll have a third sub-translator that you'll use only for URLs. The query for this sub-translator will look something like this: (liblouis-table:'http://www.nlb.no/liblouis/no-no-g0.utb')(hyphenator:none)(hyphenate-character:'x'). Note the new "hyphenate-character" feature and also that "hyphenator" is "none" so that URLs will only be broken if they are too long for the line.
Lastly the "hyphenate-character" feature must be implemented in LiblouisTranslatorJnaImplProvider. The feature will be parsed at #L172 and turned into a parameter that should eventually be passed on to DefaultLineBreaker at LiblouisTranslatorJnaImplProvider#L343.

nlbdev / pipeline

For e-mails and URLs, use mathematical hyphenation when splitting over two lines #3