jayqi / spongebob

SPoNgeBOb-CAse cONveRSioN ToOLs
https://jayqi.github.io/spongebob/
27 stars 3 forks source link

spongebobsay say incorrectly formats speech bubble when using UTF-8 characters > 1 byte #13

Closed jayqi closed 5 years ago

jayqi commented 5 years ago

When using certain Unicode characters, the spongebobsay family of functions will incorrectly whitespace pad to form the beech bubble.

To reproduce:


library(spongebob)
foo <- paste(
    paste0("pokémon", paste(rep(".", 25), collapse = ""))
    , paste0("pokemon", paste(rep(".", 30), collapse = ""))
)
cat(spongebobsay(foo))
#>  --------------------------------------- 
#> | poKémOn.........................     |
#> | pokEMOn.............................. |
#>  --------------------------------------- 
#>   \\
#>    \\    *
#>           *
#>      ----//-------
#>      \..C/--..--/ \   `A
#>       (@ )  ( @) \  \// |w
#>        \          \  \---/
#>         HGGGGGGG    \    /`
#>         V `---------`--'
#>             <<    <<
#>            ###   ###

Created on 2019-02-03 by the reprex package (v0.2.1)

The problem is because we are using sprintf under the hood here. For example, the format code %-35s says to print a string and then pad with whitespace on the right to have fill fixed width of 35.

Unfortunately, sprintf documentation says:

Field widths and precisions of %s conversions are interpreted as bytes, not characters, as described in the C standard.

Which is also what is found in the POSIX standard.

This means that any UTF-8 character that is represented by more than 1 byte will have its width incorrectly counted by sprintf.

We'll need an alternative way to pad strings with whitespace that counts characters rather than bytes, possibly a custom function.

(Note: for a good primer on character encodings, read Joel Spolsky's seminal article.)