joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

\- (soft-hyphen) and \_ (non-breaking hyphen)? #44

Closed stevengj closed 1 year ago

stevengj commented 1 year ago

Right now, the code supports \~ for non-breaking space, but the RTF spec (1.9.1) lists a couple of other special characters that could be easily supported in the same way:

image

In particular, it seems that \- (optional hyphen) should map to U+00AD (soft hyphen), and \_ (non-breaking hyphen) should map to U+2011 (non-breaking hyphen).

For example, https://github.com/joshy/striprtf/blob/751f8eda03afb034039759921af61fa811aca140/striprtf/striprtf.py#L155-L168 could be simplified to something like:

if char in specialchars:
    if not ignorable:
        out += specialchars[char]
elif char == "*":
    ignorable = True

and then just add all the other cases as additional entries in specialchars.