voikko / corevoikko

Libvoikko and essential linguistic resources
Other
89 stars 25 forks source link

Configurable hyphenation character or string (JavaScript) #35

Closed Ciantic closed 6 years ago

Ciantic commented 6 years ago

I would like to set string ­ as a hyphenation character, right now the hyphenate function inserts regular - but this is not wanted character e.g. when rendering HTML.

­ is supported by most browsers, this is probably mostly needed on JS side of the library.

hatapitk commented 6 years ago

Thanks for your good suggestion! I will add this feature to Python, Java and JavaScript APIs within the next week. (It would be good in .Net API as well but that one is already out of date in other significant ways.)

hatapitk commented 6 years ago

Actually this will not be enough for proper hyphenation on web pages. It will lead to word "vaa'an" to be replaced with "vaa­an" which in turn will be shown as "vaaan" if there is no line break. This is not what most people would expect. We will need additional boolean parameter to specify whether hyphenation points that lead to context changes should be included or excluded. At this point it is probably best to move the logic to libvoikko core. In fact this complication was one of the reasons why it was not there in the first place. Client code (such as LibreOffice) has hyphenation API that was better served with what is now getHyphenationPattern in our JavaScript API.

Ciantic commented 6 years ago

I actually ended up using a getHyphenationPattern like this:

let w = "testi";
let pattern = v.getHyphenationPattern(w);
let j = 0;
let newWord = "";
for (const char of pattern) {
    if (char === "-") {
        newWord += "­";
    }
    newWord += w[j];
    j++;
}
console.log(newWord);

It works for now.

Notice that newWord += "" has the shy character inside it, it's just not visible in the GitHub.

Not sure will my approach work with "vaa'an" word though. I think it's rather rare, if I understood correctly it's hyphenated "vaa-an" but when it does not have hyphen it has extra char "vaa'an". That logic is not doable with ­ and HTML, and the hyphenated form for it is "vaa'-an", but it probably does not matter.

hatapitk commented 6 years ago

@Ciantic your implementation that uses getHyphenationPattern seems correct to me. With the latest changes you can do the same with the following, shorter piece of code:

let w = "testi";
let newWord = v.hyphenate(w, "­", false);
console.log(newWord);

Works with C, C++, JavaScript, Java and Python.