jquast / wcwidth

Python library that measures the width of unicode strings rendered to a terminal
Other
392 stars 58 forks source link

Propose new function, width(control_codes='ignore') #79

Open jquast opened 11 months ago

jquast commented 11 months ago

Problem

As for the need for "width" function, just about every downstream library has some issue with the POSIX wcwidth() and wcswidth() functions, either in C or in this python library.

This is mainly because both functions may return -1, and the return value must be checked, but it often is not.

Although using wcswidth() on a string is the most popular use case, it has the possibility to return -1 by POSIX definition, and Markus Kuhn's 2007 implementation returns -1 for control characters.

The return value is often unchecked where it is used with sum(), slice() or screen positioning functions with surprising results.

Solution

Provide new function signature, width that always returns a "best effort" of measured distance. It may ignore or measure control codes, instead. If "catching unexpected control codes" is a desired function, we can continue to provide it as an optional keyword argument, and, rather than return -1, raise an exception.

Maybe new keyword argument control_codes with default argument 'ignore', in similar spirit to 'errors' for https://docs.python.org/3/library/stdtypes.html#bytes.decode,

Workaround

As a workaround, I have suggested to use wcwidth() directly on each individual character and clip the possible -1 return value to 0, example: https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L364

This provides the same function as wcswidth but provides a "best guess", however, this method cannot handle coming changes to wcswidth to handle zero width joiner (ZWJ) sequences.

GalaxySnail commented 11 months ago

Just out of curiosity, are there any real-world examples where this enhancement would be beneficial?

jquast commented 11 months ago

I have used to strike-through on the suggestion of the terminal sequence, I will save for another issue.

As for the need for "width" function, just about every downstream library has some issue with the POSIX wcwidth and wcswidth functions.

This is mainly because both functions may return -1, and the return value must be checked, but it often is not.

And I think all downstream users wish for us to have a single function that makes a "best effort". if a zero width joined emoji sequence also contains a newline or other control character, it is best to just return our best estimate of the measurement rather than -1 as wcswidth() does.

wcswidth()

Although using wcswidth() on string is the most popular use case, it has the possibility to return -1 by POSIX definition, and Markus Kuhn's 2007 implementation returns -1 for control characters, chr(1) through 32.

wcwidth()

As a workaround, I have suggested to use wcwidth() directly on each individual character and clip the possible -1 return value to 0, example: https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L364

This provides the same function as wcswidth but provides a "best guess", however, this method cannot handle coming changes to wcswidth to handle zero width joiner (ZWJ) sequences.

jquast commented 11 months ago

Although I am open to changing wcswidth() to never return -1 and make a "best effort", it would deviate from the original 2007 implementation and POSIX specification, and this is why i suggest an entirely new function name and strongly suggest it is the best alternative in the docstrings of wcswidth and wcwidth

GalaxySnail commented 11 months ago

Thank you for the clarification!

jquast commented 10 months ago

I have created it in development branch but I will make a bugfix release first, I will make a PR for this next, https://github.com/jquast/wcwidth/blob/1f1443b7af38b9e1b36a895b5d998f511021d377/wcwidth/wcwidth.py#L262-L277

jquast commented 10 months ago

I have revised this description and related issue #92

And I do think they are closely related. control characters like \b is just as much a terminal sequences as \x1b[0m. Ignoring the '\x1b' is not enough, I think we should measure the full sequence \x1b[0m as 0 instead of 3 (char lengths 0, 1, 1, 1). And provide a choice for ambigous characters like '\b' and or '\x1b[D' as either -1 (moving backwards, 'parse') or 0 (ignored)