Open isomarcte opened 2 years ago
👍 thanks for all your work on this. I think a series/2.x is the best path forward for the time being.
Since this is an issue about series/2.x in general, I'm thinking we should consider using a newtype for CIString
(at least on Scala 3) now that we are getting a typeclass-based Map
in https://github.com/typelevel/cats/pull/4185 which does not depend on universal equals/hashCode.
@armanbilge Do you just mean CIString
encoded as an opaque type or are you talking about wrapping CIString
again? And are you foreseeing us having hashCode
differ from Hash
?
Do you just mean
CIString
encoded as an opaque type
Yes, this. Which would mean that hashCode
would differ from Hash
.
Yeah, so in #232 that changes a bit. I changed the underlying hashCode/equals/compare to be based on a CanonicalFullCaseFoldedString
.
There is still some discussion to have there, but all CanonicalFullCaseFoldedString
values have the same representation (for the same input) and thus the same universal hashCode and equality. So CanonicalFullCaseFoldedString
could be an opaque type, though it wouldn't need to for hash/equal reasons. And under this encoding CIString
is effectively a specialized (String, CanonicalCaseFoldedString)
so making it opaque doesn't gain you as much.
FWIW, I'm starting to be of the opinion we should be using CanonicalCaseFoldedString
for most use cases where we use CIString
. That is, I don't know if we need to hold a reference to the original representation as the normative case.
I hope that makes some amount of sense. Rereading my comment makes me feel like I didn't explain that well.
This is what I'm talking about: https://github.com/typelevel/case-insensitive/pull/232/files#diff-90a144a03717314cdaa1151276da76557be0f1e78c890580b57c9d95406f95f6R80
@armanbilge can you make series/1.x.x
and series/2.x.x
branches? I don't have access apparently.
Neither do I :)
@isomarcte to clarify, my proposal is we replace:
final class CIString private (override val toString: String)
with
opaque type CIString = String
If we do this, then the universal hashCode and equals for CIString
will be those of String
itself, because the opaque type
is erased to String
. This is why the instances will be incompatible with them.
Wait, sorry, I wasn't looking carefully. I see, CIString
can no longer be represented as an opaque type, sorry! 🤔
Edit: wait, nope, I'm confused by all the various implementations 😬
Yeah, there are a lot of types in the Unicode standard. For our discussion here we just need to focus on CanonicalFullCaseFoldedString
and CIString
.
The new encoding of CIString
on my branch is similar to this.
final case class CIString private (original: String, caselessString: CanonicalFullCaseFoldedString)
To make a cassless string, in general, you need to case fold it, then normalize it (this is simplified, there can be more steps depending on the type of caseless string).
Thus caselessString
is the result of normalize(caseFold(original))
.
The reason I changed the representation of CIString
is because what we are doing in 1.x.x is incorrect for caseless strings. To get to a caseless string is more complicated than just toUpper.toLower
on each char, so it seemed reasonable to hold a reference to the caseless string in CIString
, rather than computing it each time we needed to do an operation on the class. That also allows us to use the CanonicalFullCaseFoldedString
as the source of hashCode
and equals
.
There are other ways we could do this. We could have the caselessString
be a def,
opaque type CIString = String
extension (inline ciString: CIString) {
inline def caselessString: CanonicalFullCaseFoldedString = normalize(caseFold(ciString))
}
Then all the methods on CIString
would need to convert to a CanonicalFullCaseFoldedString
on each operation. We could cache the hash via a null init pattern as is currently done on main
, but we couldn't easily do that for equality checks and the other methods.
All of these gymnastics are the result of us keeping around a reference to the original input string in CIString
. That's why I'm leaning on making CanonicalFullCaseFoldedString
(perhaps with a more succinct name) the default type we use, rather than CIString
. CanonicalFullCaseFoldedString
can be trivial represented as an opaque type, though we don't need to for hash/eq purposes.
Thanks for the detailed explanation, that makes sense to me. Regarding whether to cache or not, it probably makes sense? Not sure if there are situations where lots of CIString
s are allocated but no ops are called on them.
All of these gymnastics are the result of us keeping around a reference to the original input string in
CIString
.
The FAQ suggests there are Reasons™, but I'm not sure what they are :) https://typelevel.org/case-insensitive/#why-pay-for-a-wrapper-when-you-can-fold-case-before-storing
I added @armanbilge and @isomarcte to the team, so anyone should be able to create the necessary branches.
Not keeping the original would be a non-semantic, but highly visible, change to HTTP rendering. For example, all field names would be rendered lower case.
I don't know how much uptake CIString
has seen outside http4s to evaluate how bad a binary breakage would be, but we know what immense pain the last source breakage was. I would very much prefer to avoid source-breaking changes to http4s in this.
Here are steward PRs for case-insensitive: https://github.com/search?q=case-insensitive+author%3Ascala-steward&type=Issues&ref=advsearch&l=&l=
Besides http4s, I recognize trace4cats.
@isomarcte I created a series/2.x
branch and re-targeted https://github.com/typelevel/case-insensitive/pull/232 to there.
I think some form of source-breaking changes are inevitable (at least in the form of deprecations) since in https://github.com/typelevel/case-insensitive/pull/232#issuecomment-1031586847 we realized many of the exsting APIs are unsound.
Personally I think the best strategy to is keep pushing for http4s 1.x, with better CI strings, better URIs, better entities, etc.
I'm considering targeting https://github.com/typelevel/case-insensitive/pull/232 to a
series/2.x.x
branch.@rossabaker @armanbilge any objections to me creating that branch and moving the code there?