Make Data.String.CodePoints the default

hdgarrood commented 6 years ago

@michaelficarra originally suggested this and I agree; I think Data.String.CodePoints should really be the default. Unless you're certain you won't be working with anything outside the Basic Multilingual Plane, and you've identified string manipulations as a performance bottleneck, you should really be using the functions in Data.String.CodePoints.

For the functions whose type signatures are the same across both modules, like length :: String -> Int, this has the potential to be quite problematic, so I think we need to be quite careful about it. I'd suggest the following:

In the next breaking release:
- we create a module Data.String.CodeUnits, with the exact same exports as the current Data.String,
- we add a notice at the very top of Data.String, detailing that the functions within currently operate on code units, not code points; that this will change in the next breaking release; and that you should very probably be using Data.String.CodePoints instead (unless you are sure you want to operate on code units, in which case you can use Data.String.CodeUnits)
In the breaking release after that one:
- change Data.String so that it re-exports everything from Data.String.CodePoints
- remove the notices
- consider deprecating the Data.String.CodePoints module, for removal in a subsequent breaking release?

garyb commented 6 years ago

I've taken a slightly different approach in the 0.12 branch as it is:

Data.String and Data.String.NonEmpty only export functions that are codepoint/codeunit agnostic
Data.String.CodeUnits now exists, with all the relevant functions moved there (likewise for NES)
the .CodeUnits and .CodePoints modules re-export the agnostic stuff too

This means people will be forced to choose between CodeUnit and CodePoint functions at least, and it avoids the potential future problem of people not noticing the switch if Data.String changed which set of functions it re-exports.

So for most people the migration path now will just be replacing import Data.String with import Data.String.CodeXXX.

Sound good?

kritzcreek commented 6 years ago

I'd prefer it to re-export the functions from Data.String.CodePoints from Data.String to have a sensible default for import Data.String. Codepoints and Codeunits are pretty technical terms that not everyone will know about and we should strive to make it easy to do the right thing (which means using CodePoints in this case).

michaelficarra commented 6 years ago

@garyb While we're breaking things for 0.12, let's also make Data.Char.toUpper and toLower return a string. Sorry for derailing.

garyb commented 6 years ago

Sure! Have you got an example of where that happens? (just curious)

kritzcreek commented 6 years ago

'ß'.toUpperCase()
"SS"

michaelficarra commented 6 years ago

@garyb See https://unicode.org/Public/UNIDATA/SpecialCasing.txt.

garyb commented 6 years ago

Nice, thanks!

garyb commented 6 years ago

I think we can just drop the Char module actually - doing the case alteration can be done in String form, so having a Char -> String version is only a miniscule ergonomic win. There are no other functions in that module now, since fromCharCode / toCharCode became toEnum / fromEnums instead.

kritzcreek commented 6 years ago

I disagree with

fromCharCode -> toEnumWithDefault bottom top

There's just no good way to discover that.

garyb commented 6 years ago

Hmm maybe... although it can be documented somewhere.

fromCharCode was a lie, that function should always have been returning Maybe Char which is how toEnum works at least.

hdgarrood commented 6 years ago

Now would be a good time to address that and have it return a Maybe Char then, surely? I don’t think redundancy is necessarily bad.

kritzcreek commented 6 years ago

I made the change to have Data.String reexport Data.String.CodePoints in: 1fbc4c0cf0fb816870a6841fa83c5cbdcddaaf22

hdgarrood commented 5 years ago

My understanding of what we have now is:

Data.String.Common contains functions which behave in the same way regardless of whether we are considering strings as sequences of code points or code units
Data.String.Code{Points,Units} contain functions whose behaviour differs based on whether we are considering strings as sequences of code points or code units
Data.String re-exports the entirety of Data.String.Common and Data.String.CodePoints

If this is correct, I'm happy and I think we can close this?

garyb commented 5 years ago

Yeah, that's right 👍

purescript / purescript-strings

Make Data.String.CodePoints the default #95