Make character spacing with hyphenation possible natively

By manually inserting option hyphens (\u00AD) in the USFM text using a pre-processing script it is now possible to have variable character spacing AND hyphenation operating at the same time. However, the current method for inserting the optional hyphens using a pre-processing script it slower than it should be (due to building/re-reading the hyphenation dict each time, and the method in which the changes are applied - which grows with the length of the hyphenation file). InsertOptionalHyphens(Concept).py.txt

So it would be good to do this natively within PTXprint to speed up the process considerably.

When creating the hyphenation dictionary, it would be good to have an option (ON by default) to "only include hyphenation data that has been approved in Paratext" (these are marked with an asterisk in the hyphenatedWords.txt file) otherwise we would pick up all the hyphenation SUGGESTIONS, even for small words which may not be useful.

Another complication relates to the fact that the hyphenatedWords.txt file (which Paratext generates) is not case-sensitive, but the addition of optional hyphenation points need to maintain the appropriate case in the target USFM text.

Perhaps this kind of approach would work; but I'm also concerned about hyphenated words which can be a mixture of the two: It may be best to NOT add any optional hyphenation points to words that are already hyphenated.

import re

def case_sensitive_replace(replacement, match):
    if match.group(0).isupper():
        return replacement.capitalize()
    else:
        return replacement.lower()

def replace_word(original_string, pattern, replacement):
    return re.sub(pattern, lambda match: case_sensitive_replace(replacement, match), original_string, flags=re.IGNORECASE)

original_string = "Hello World. hello world."
pattern = "hello"
replacement = "hi"

result = replace_word(original_string, pattern, replacement)
print(result)

sillsdev / ptx2pdf

Make character spacing with hyphenation possible natively #942