typelevel / case-insensitive

A case-insensitive string for Scala
Apache License 2.0
37 stars 15 forks source link

Series 2.x.x #247

Open isomarcte opened 2 years ago

isomarcte commented 2 years ago

I'm considering targeting https://github.com/typelevel/case-insensitive/pull/232 to a series/2.x.x branch.

@rossabaker @armanbilge any objections to me creating that branch and moving the code there?

armanbilge commented 2 years ago

👍 thanks for all your work on this. I think a series/2.x is the best path forward for the time being.

Since this is an issue about series/2.x in general, I'm thinking we should consider using a newtype for CIString (at least on Scala 3) now that we are getting a typeclass-based Map in https://github.com/typelevel/cats/pull/4185 which does not depend on universal equals/hashCode.

isomarcte commented 2 years ago

@armanbilge Do you just mean CIString encoded as an opaque type or are you talking about wrapping CIString again? And are you foreseeing us having hashCode differ from Hash?

armanbilge commented 2 years ago

Do you just mean CIString encoded as an opaque type

Yes, this. Which would mean that hashCode would differ from Hash.

isomarcte commented 2 years ago

Yeah, so in #232 that changes a bit. I changed the underlying hashCode/equals/compare to be based on a CanonicalFullCaseFoldedString.

There is still some discussion to have there, but all CanonicalFullCaseFoldedString values have the same representation (for the same input) and thus the same universal hashCode and equality. So CanonicalFullCaseFoldedString could be an opaque type, though it wouldn't need to for hash/equal reasons. And under this encoding CIString is effectively a specialized (String, CanonicalCaseFoldedString) so making it opaque doesn't gain you as much.

FWIW, I'm starting to be of the opinion we should be using CanonicalCaseFoldedString for most use cases where we use CIString. That is, I don't know if we need to hold a reference to the original representation as the normative case.

isomarcte commented 2 years ago

I hope that makes some amount of sense. Rereading my comment makes me feel like I didn't explain that well.

This is what I'm talking about: https://github.com/typelevel/case-insensitive/pull/232/files#diff-90a144a03717314cdaa1151276da76557be0f1e78c890580b57c9d95406f95f6R80

isomarcte commented 2 years ago

@armanbilge can you make series/1.x.x and series/2.x.x branches? I don't have access apparently.

armanbilge commented 2 years ago

Neither do I :)

armanbilge commented 2 years ago

@isomarcte to clarify, my proposal is we replace:

final class CIString private (override val toString: String)

with

opaque type CIString = String

If we do this, then the universal hashCode and equals for CIString will be those of String itself, because the opaque type is erased to String. This is why the instances will be incompatible with them.

armanbilge commented 2 years ago

Wait, sorry, I wasn't looking carefully. I see, CIString can no longer be represented as an opaque type, sorry! 🤔

Edit: wait, nope, I'm confused by all the various implementations 😬

isomarcte commented 2 years ago

Yeah, there are a lot of types in the Unicode standard. For our discussion here we just need to focus on CanonicalFullCaseFoldedString and CIString.

The new encoding of CIString on my branch is similar to this.

final case class CIString private (original: String, caselessString: CanonicalFullCaseFoldedString)

To make a cassless string, in general, you need to case fold it, then normalize it (this is simplified, there can be more steps depending on the type of caseless string).

Thus caselessString is the result of normalize(caseFold(original)).

The reason I changed the representation of CIString is because what we are doing in 1.x.x is incorrect for caseless strings. To get to a caseless string is more complicated than just toUpper.toLower on each char, so it seemed reasonable to hold a reference to the caseless string in CIString, rather than computing it each time we needed to do an operation on the class. That also allows us to use the CanonicalFullCaseFoldedString as the source of hashCode and equals.

There are other ways we could do this. We could have the caselessString be a def,

opaque type CIString = String

extension (inline ciString: CIString) {
  inline def caselessString: CanonicalFullCaseFoldedString = normalize(caseFold(ciString))
}

Then all the methods on CIString would need to convert to a CanonicalFullCaseFoldedString on each operation. We could cache the hash via a null init pattern as is currently done on main, but we couldn't easily do that for equality checks and the other methods.

All of these gymnastics are the result of us keeping around a reference to the original input string in CIString. That's why I'm leaning on making CanonicalFullCaseFoldedString (perhaps with a more succinct name) the default type we use, rather than CIString. CanonicalFullCaseFoldedString can be trivial represented as an opaque type, though we don't need to for hash/eq purposes.

armanbilge commented 2 years ago

Thanks for the detailed explanation, that makes sense to me. Regarding whether to cache or not, it probably makes sense? Not sure if there are situations where lots of CIStrings are allocated but no ops are called on them.

All of these gymnastics are the result of us keeping around a reference to the original input string in CIString.

The FAQ suggests there are Reasons™, but I'm not sure what they are :) https://typelevel.org/case-insensitive/#why-pay-for-a-wrapper-when-you-can-fold-case-before-storing

rossabaker commented 2 years ago

I added @armanbilge and @isomarcte to the team, so anyone should be able to create the necessary branches.

Not keeping the original would be a non-semantic, but highly visible, change to HTTP rendering. For example, all field names would be rendered lower case.

I don't know how much uptake CIString has seen outside http4s to evaluate how bad a binary breakage would be, but we know what immense pain the last source breakage was. I would very much prefer to avoid source-breaking changes to http4s in this.

armanbilge commented 2 years ago

Here are steward PRs for case-insensitive: https://github.com/search?q=case-insensitive+author%3Ascala-steward&type=Issues&ref=advsearch&l=&l=

Besides http4s, I recognize trace4cats.

armanbilge commented 2 years ago

@isomarcte I created a series/2.x branch and re-targeted https://github.com/typelevel/case-insensitive/pull/232 to there.

I think some form of source-breaking changes are inevitable (at least in the form of deprecations) since in https://github.com/typelevel/case-insensitive/pull/232#issuecomment-1031586847 we realized many of the exsting APIs are unsound.

Personally I think the best strategy to is keep pushing for http4s 1.x, with better CI strings, better URIs, better entities, etc.