vlang / v

Simple, fast, safe, compiled language for developing maintainable software. Compiles itself in <1s with zero library dependencies. Supports automatic C => V translation. https://vlang.io
MIT License
35.86k stars 2.17k forks source link

Support splitting Strings into Unicode Grapheme Cluster #22117

Open peppergrayxyz opened 3 months ago

peppergrayxyz commented 3 months ago

Describe the feature

When working with Unicode, we usually don't care about the bytes, but we usually also don't care about the code points (runes). What we mostly care is characters displayed on screen (grapheme clusters). Unicode provides an algorithm to split strings into grapheme clusters (units of display width one). This feature is about including grapheme cluster splitting into builtin.

Use Case

Anyone working with a UI, who wants to know:

Example:

This text should be right aligned:

examples := [
    '\u006E\u0303',
    '\U0001F3F3\uFE0F\u200D\U0001F308',
    'ห์', 
    'ปีเตอร์'
]

println("0123456789abcdefgh")
for text in examples 
{
    println("${text:10}")
}
0123456789abcdefgh
         ñ
    🏳️‍🌈
        ห์
   ปีเตอร์

But it isn't.

Proposed Solution

Add a feature to split a string into graphemes

hello := 'Hello World 🏳️‍🌈'
hello_graphemes := hello.graphemes () // [`H`, `e`, `l`, `l`, `o`, ` `, `W`, `o`, `r`, `l`, `d`, ` `, `🏳️‍🌈`]

Current Behavior

examples := [
    '\u006E\u0303',
    '\U0001F3F3\uFE0F\u200D\U0001F308',
    'ห์', 
    'ปีเตอร์'
]

for text in examples 
{
    println("0123456789abcdefgh")
    println(text)
    println(text.runes())
}
0123456789abcdefgh
ñ
[`n`, `̃`]
0123456789abcdefgh
🏳️‍🌈
[`🏳`, `️`, `‍`, `🌈`]
0123456789abcdefgh
ห์
[`ห`, `์`]
0123456789abcdefgh
ปีเตอร์
[`ป`, `ี`, `เ`, `ต`, `อ`, `ร`, `์`]

Proposed behavior:

examples := [
    '\u006E\u0303',
    '\U0001F3F3\uFE0F\u200D\U0001F308',
    'ห์', 
    'ปีเตอร์'
]

for text in examples 
{
    println("0123456789abcdefgh")
    println(text)
    println(text.graphemes())
}
0123456789abcdefgh
ñ
[`ñ`]
0123456789abcdefgh
🏳️‍🌈
[`🏳️‍🌈`]
0123456789abcdefgh
ห์
[`ห์`]
0123456789abcdefgh
ปีเตอร์
[`ปี`, `เ`, `ต`, `อ`, `ร์`]

Further suggestions

e.g.

string[n] ... access n-th grapheme
string.len ... number of graphemes
string.bytes()[n] ... access n-th byte
string.bytes().len ... number of bytes

Other Information

Unicode Reference and some more info on the background

This feature would also fix this bug:

Acknowledgements

Version used

0.4.7

Environment details (OS name and version, etc.)

V full version: V 0.4.7 7baff15
OS: linux, "Manjaro Linux"
Processor: 16 cpus, 64bit, little endian, AMD Ryzen 7 7840U w/ Radeon  780M Graphics

getwd: /home/pepper
vexe: /usr/lib/vlang/v
vexe mtime: 2024-08-26 17:34:57

vroot: NOT writable, value: /usr/lib/vlang
VMODULES: OK, value: /home/pepper/.vmodules
VTMP: OK, value: /tmp/v_1000

Git version: git version 2.46.0
Git vroot status: Error: fatal: not a git repository (or any of the parent directories): .git
.git/config present: false

CC version: cc (GCC) 14.2.1 20240805
thirdparty/tcc status: thirdparty-linux-amd64 0134e9b9-dirty

[!NOTE] You can use the 👍 reaction to increase the issue's priority for developers.

Please note that only the 👍 reaction to the issue itself counts as a vote. Other reactions and those to comments will not be taken into account.

Wajinn commented 2 months ago

Maybe want to take a look at uniseg or possibly consult with magic003, if available.