move-language / move

Apache License 2.0
2.26k stars 689 forks source link

std::string Functions Should be Indexed by Char Count, Not Utf8 Bytes #842

Open PaulFidika opened 1 year ago

PaulFidika commented 1 year ago

Unfortunately std::string was written such that its indices refer to bytes of the underlying string, rather than characters of the underlying string. Unlike ascii, these two are not the same thing because std::string uses utf8 encodings, and utf8 encodings can span multiple bytes (1 - 4 bytes). This makes the API confusing and needlessly complex:

    /// Returns a sub-string using the given byte indices, where `i` is the first byte position and `j` is the start
    /// of the first byte not included (or the length of the string). The indices must be at valid utf8 char boundaries,
    /// guaranteeing that the result is valid utf8.
    public fun sub_string(s: &String, i: u64, j: u64): String {

how is an application-developer supposed to know which bytes in a string correspond to utf8 char boundaries? For this to be useable the application-developer would have to write their own custom implementation of utf8 in Move right into their package, which is way too much work for such a rudimentary operation.

Similarly, std::string::length refers to the number of bytes, NOT the number of characters, which is again, confusing. Similarly std::string::insert again refers to indices of the underlying byte-array, NOT the character array.

wrwg commented 1 year ago

The way how string is using byte and not character indices is exactly the same as Rust does it. Its a way to implement utf8 efficiently. There is no need to actually index characters. All indices are obtained by other string functions. You are not supposed to index characters directly. There is no need for this in most practical scenarios. Closing as working as intended.

PaulFidika commented 1 year ago

The way how string is using byte and not character indices is exactly the same as Rust does it. Its a way to implement utf8 efficiently. There is no need to actually index characters. All indices are obtained by other string functions. You are not supposed to index characters directly. There is no need for this in most practical scenarios. Closing as working as intended.

Rust strings have char and char_indices methods, which allow you to enumerate over individual characters within a string, which is very useful. Right now it's not possible to enumerate through a string, unless you build a custom utf8 implementation into your package, because std::string::internal_is_char_boundary isn't a public function.

bytedeveloperr commented 1 year ago

@wrwg

For utf8, the bytes length is not always useful as the characters count because each individual character can be of variable length. This is unlike ASCII where the bytes length is the same as character length.

Most of the time we only just want to know the number of characters that makes up string, the length function will always work for ASCII but not always work for utf8.

How about having a count or chars_count function that returns the characters count? it doesn't affect the existing length function.

wrwg commented 1 year ago

I wonder whether in this case, we should just add fun next_char_boundary(&string, u64): u64. This allows to iterate over characters, and use substring to split those graphemes out into their own strings.

PaulFidika commented 1 year ago

I wonder whether in this case, we should just add fun next_char_boundary(&string, u64): u64. This allows to iterate over characters, and use substring to split those graphemes out into their own strings.

Yeah something like this would be helpful; ideally developers should be able to iterate over characters and manipulate strings without having to worry about the complexity of utf8 character encodings.

wrwg commented 1 year ago

Can you (a) please make a more reasonable title for this issue. The title is pretty bad. And (b) give some more reasons why we need this. What is your use case? I haven't seen many smart contracts which manipulate strings (way too expensive).

Normally, extensions to the stdlib go through really high scrutiny and are discussed in the community meeting first. Also, once we introduce a new native function, there is some extra work to pipe this through the layers correctly with gas, incremental rollout, etc. That is why we rarely take community contributions. There is a bunch of further string functions we want to add (specifically formatting) so instead of #849 this one should go probably in a batch with the other ones.

PaulFidika commented 1 year ago

Can you (a) please make a more reasonable title for this issue. The title is pretty bad. And (b) give some more reasons why we need this. What is your use case? I haven't seen many smart contracts which manipulate strings (way too expensive).

Normally, extensions to the stdlib go through really high scrutiny and are discussed in the community meeting first. Also, once we introduce a new native function, there is some extra work to pipe this through the layers correctly with gas, incremental rollout, etc. That is why we rarely take community contributions. There is a bunch of further string functions we want to add (specifically formatting) so instead of #849 this one should go probably in a batch with the other ones.

I renamed the issue; sorry for the original name! Too aggressive haha.

I've been building out Sui's NFT standard, and what I'm doing is grabbing the type-names (addr::module_name::struct_name) and using that for security in a lot of places, which involves a lot of string manipulation; for example parsing 0x1::coin::Coin<0x2::sui::SUI> into just 0x2::sui::SUI, or parsing user-input like "Option<vector>" into { optional: true, type: "vector" }. This kind of takes away a lot of the static-checking Move was built around, and makes the behavior more dynamic, which is a good thing and a bad thing...

Some examples I wrote here:

https://github.com/capsule-creator/capsules/blob/master/packages/sui_utils/sources/encode.move

I've been using ascii because ascii is easier to work with (1 byte = 1 char), but ascii is less gas-efficient (at least on Sui) than utf8; it might be a gas-mispricing or it might be due to utf8 being a native function while ascii is all written in Move. Either way I'd like to be able to deprecate ascii entirely from my code base and use utf8 exclusively (especially if it means cheaper gas) because it's easier to present one unified string type to users rather than try to have two string-modules and have users pick between ascii or utf8 (devs don't care about string implementation details; they just expect stuff to work).

Although as you've mentioned, I've noticed that string-operations are relatively expensive.

This is an example of a simple task to implement using utf8:

let i = 0;
let reverse = string::utf8(b"hello");
let len = string::char_count(&str);
while (i < len) {
   let char = string::sub_string_char(&str, i, i + 1);
   string::insert_char(&mut reverse, 1, char);
};

This iterates through the characters, and inserts them after the 'h' and before the 'ello' in 'hello', putting them in reverse order. In order to do this I propose 3 functions:

how do you feel about these? They're new functions so they don't break the existing API. Is there an issue / PR for the other string functions that you're working on? Yeah I'd love to bundle them all together; that makes sense.

What is the process for getting a PR included into the Move core? I imagine it's a pretty high bar because of the security risks implied. Thanks for your help!

wrwg commented 1 year ago

Thanks Paul, the use case sounds convincing. I approved the PR but also someone from Sui should look. In order to get the usable from Sui, more work has to be done as outlined in the PR.

wrwg commented 1 year ago

I forgot: yes, feel free to add the char functions, perhaps to the same PR which is already out. You need to negotiate with the Sui team (I added @tnowacki to the PR) on getting this into the Sui system. In Aptos it will likely land a few weeks later together with other new string functions for formatting (which still have to be written).