purescript / purescript-strings

String utility functions, Char type, regular expressions.
BSD 3-Clause "New" or "Revised" License
54 stars 71 forks source link

Make Data.String.CodePoints the default #95

Closed hdgarrood closed 5 years ago

hdgarrood commented 6 years ago

@michaelficarra originally suggested this and I agree; I think Data.String.CodePoints should really be the default. Unless you're certain you won't be working with anything outside the Basic Multilingual Plane, and you've identified string manipulations as a performance bottleneck, you should really be using the functions in Data.String.CodePoints.

For the functions whose type signatures are the same across both modules, like length :: String -> Int, this has the potential to be quite problematic, so I think we need to be quite careful about it. I'd suggest the following:

garyb commented 6 years ago

I've taken a slightly different approach in the 0.12 branch as it is:

This means people will be forced to choose between CodeUnit and CodePoint functions at least, and it avoids the potential future problem of people not noticing the switch if Data.String changed which set of functions it re-exports.

So for most people the migration path now will just be replacing import Data.String with import Data.String.CodeXXX.

Sound good?

kritzcreek commented 6 years ago

I'd prefer it to re-export the functions from Data.String.CodePoints from Data.String to have a sensible default for import Data.String. Codepoints and Codeunits are pretty technical terms that not everyone will know about and we should strive to make it easy to do the right thing (which means using CodePoints in this case).

michaelficarra commented 6 years ago

@garyb While we're breaking things for 0.12, let's also make Data.Char.toUpper and toLower return a string. Sorry for derailing.

garyb commented 6 years ago

Sure! Have you got an example of where that happens? (just curious)

kritzcreek commented 6 years ago
'ß'.toUpperCase()
"SS"
michaelficarra commented 6 years ago

@garyb See https://unicode.org/Public/UNIDATA/SpecialCasing.txt.

garyb commented 6 years ago

Nice, thanks!

garyb commented 6 years ago

I think we can just drop the Char module actually - doing the case alteration can be done in String form, so having a Char -> String version is only a miniscule ergonomic win. There are no other functions in that module now, since fromCharCode / toCharCode became toEnum / fromEnums instead.

kritzcreek commented 6 years ago

I disagree with

fromCharCode -> toEnumWithDefault bottom top

There's just no good way to discover that.

garyb commented 6 years ago

Hmm maybe... although it can be documented somewhere.

fromCharCode was a lie, that function should always have been returning Maybe Char which is how toEnum works at least.

hdgarrood commented 6 years ago

Now would be a good time to address that and have it return a Maybe Char then, surely? I don’t think redundancy is necessarily bad.

kritzcreek commented 6 years ago

I made the change to have Data.String reexport Data.String.CodePoints in: 1fbc4c0cf0fb816870a6841fa83c5cbdcddaaf22

hdgarrood commented 5 years ago

My understanding of what we have now is:

If this is correct, I'm happy and I think we can close this?

garyb commented 5 years ago

Yeah, that's right 👍