rust-lang / libs-team

The home of the library team
Apache License 2.0
110 stars 18 forks source link

Add nearly in-place case change methods to String #316

Open krtab opened 7 months ago

krtab commented 7 months ago

Proposal

Problem statement

As of today, changing the case of a String is done by using methods defined on str, which implies that a new buffer is allocated each time.

Yet, in most use cases, upper/lower-casing of UTF-8 strings can be done in place. Indeed, the set of codepoints which needs strictly more or less bytes to encode when their case is changed is small, and even more so when considering commonly used languages (cf https://raw.githubusercontent.com/krtab/utf8caseinplace/master/stats/out.txt).

Hence, this APC proposes that a new API is added to String to change cases and do so efficiently by consuming self and reusing the buffer, not allocating in most cases.

Motivating examples or use cases

Case changes that can consume the String they are working on include:

And likely other I couldn't search for.

Solution sketch

This would add the following methods (names to be determined)

impl String {
    fn into_uppercase(self) -> String;
    fn into_lowercase(self) -> String;
}

The exact implementation remains to be discussed, but the idea would be that in cases where it is possible, the case change is done in place. Once that isn't possible, a auxiliary DE-queue can be used to store bytes temporarily.

Alternatives

This could be done as a crate but people would use it much less and resort to str::to_othercase instead. Moreover, availability on crates.io of an up-to-date version of the Unicode database allowing correct case change for all situations is not guaranteed, and core::unicode methods used in str::to_othercase are not public. Finally, having both implementations in std help keep them in sync an iso-behaved.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution:

CryZe commented 7 months ago

If everything is done in place, then the methods really only need to take &mut self instead of transferring ownership. Otherwise I really like this proposal.

krtab commented 7 months ago

I was a bit unclear: the idea is that you still take a String to handle the case where you have to allocate more room, but that that case should be uncommon.

Another option though would be to actually work in place and leave characters whose case cannot be changed as is or replace them with a default character. This can make sense in the use case situation where you are changing case to compare to a default value (which is a common use case, even though the proper thing to do would be normalization) for example s.to_uppercase() == "TRUE"

cuviper commented 7 months ago

I was a bit unclear: the idea is that you still take a String to handle the case where you have to allocate more room, but that that case should be uncommon.

You can still use &mut self with that uncommon case -- just create the new string and replace the old when done.

krtab commented 7 months ago

My bad, I was too focused on &mut str and didn't correctly understood you guys were talking about &mut String, you are right.

jdahlstrom commented 5 months ago

String::make_othercase(&mut self) would nicely complement str::make_ascii_othercase(&mut self).