Closed Martoon-00 closed 5 years ago
I tried to figure out why text
couldn't do it by default.
Your rule seems equivalent to T.unpack (T.pack a) = a
. Now, if we look at unpack
and pack
they are defined as
unpack = S.unstreamList . stream
{-# INLINE [1] unpack #-}
pack = unstream . S.map safe . S.streamList
{-# INLINE [1] pack #-}
After they get inlined, the rule seems to be
(S.unstreamList . stream) ((unstream . S.map safe . S.streamList) a)
If we also inline function composition, we get
S.unstreamList (stream (unstream (S.map safe (S.streamList a))))
stream
and unstream
surely cancel out via this rule:
{-# RULES "STREAM stream/unstream fusion" forall s. stream (unstream s) = s #-}
So we are left with
S.unstreamList (S.map safe (S.streamList a))
Now, what's this safe
function? Turns out it's defined as
safe :: Char -> Char
safe c
| ord c .&. 0x1ff800 /= 0xd800 = c
| otherwise = '\xfffd'
Aha, so it's mapping some codepoints to '\xfffd'
! There's a comment on top of it to explain this:
-- UTF-16 surrogate code points are not included in the set of Unicode
-- scalar values, but are unfortunately admitted as valid 'Char'
-- values by Haskell. They cannot be represented in a 'Text'. This
-- function remaps those code points to the Unicode replacement
-- character (U+FFFD, \'�\'), and leaves other code points
-- unchanged.
This logic is lost with your rewrite rule. Not a huge loss, but it does mean that your rewrite rule isn't meaning preserving. I bet you could write this rewrite rule instead:
{-# RULES "pack/unpack" forall s. S.unstreamList (S.map safe (S.streamList a)) = a #-}
Magnificent!
I tried to delve into sources, but stopped at not finding rules for pack/unpack
:unamused:
The nice news is that with your rule I get the same results as with my incorrect approach, at least when applied to string literals. Pushed the fixed version.
My rule does not preserve semantics either, but it should play nicer with other rewrite rules from text
.
https://i.imgur.com/D1ZLGiG.png
Btw, Closes #212
should go to Related issues
and description should go to Description
:)
I had to add
stringToText
andtextToString
functions in order to specify inlining pragma as mentioned
I don't see these functions, did you delete them?
/home/travis/build/serokell/universum/src/Universum/String/Conversion.hs:6:16:
unknown flag in {-# OPTIONS_GHC #-} pragma: -Wno-orphans
Sad story.
I don't see these functions, did you delete them?
Yes, forgot to update commit/issue description.
I'm trying to add a test on a character which is usually replaced by that safe
, and apparently, the new rewrite rule doesn't fire there.
It indeed fires in benchmarks, and if I replace the rule with forall s. T.unpack (T.pack s) = s
, then it also works in tests.
My guess is that, since unstream (stream s) = s
rule is applied at the last phase of the simplifier, probably our rule simply does not fire. I'd like to check for certain with -ddump-simpl
/-ddump-rule-rewrites
/--dump-inlinings
, but their output looks meaningless to me (like, if there were no inlinings or rewrites at all except for inlining toString
and toText
sometimes).
@int-index do you have any thoughts on this?
It indeed fires in benchmarks, and if I replace the rule with
forall s. T.unpack (T.pack s) = s
, then it also works in tests.
I think my rule has a chance to fire after T.unpack
and T.pack
are inlined (and they are marked as {-# INLINE #-}
), but the rule above has a chance to fire before they are inlined. Might as well keep both rules, or figure out what's going on with phases.
Kaef, it works.
Although I have a feeling that our tests are pretty unstable. :roll_eyes:
@Martoon-00 btw, did you consider toText . toString
? It's probably less useful with our usage patterns, but anyway, doesn't it make sense to add it as well? Or does it work out of box?
@Martoon-00 btw, did you consider
toText . toString
? It's probably less useful with our usage patterns, but anyway, doesn't it make sense to add it as well? Or does it work out of box?
Okay, why not add it.
Ok, cool. Just in case: did you run benchmarks and verify that toText . toString
rules actually improves anything?
I assigned it fix
type because I think that probability of breaking something (due to slightly changed semantics) is pretty much negligible.
Ok, cool. Just in case: did you run benchmarks and verify that
toText . toString
rules actually improves anything?
Yes, I got something like 0.2 ms with the rule against ~4 ms without it.
It seems we should do #214 first.
Probably #216 should let us merge this PR.
Rebase on master
and CI will hopefully pass here.
The base branch requires all commits to be signed.
:trollface:
:tada:
Something happened that I completely cannot explain at the moment.
@int-index's observation shows that pack = unstream . S.map safe . S.streamList
. I checked this statement back then, I couldn't miss checking it as in the such case I likely wouldn't be able to import those map
and other methods.
And now when I open text
's sources for that 1.2.3.1
version, instead I see pack = unstream . S.streamList . L.map safe
. In fact, all versions of text
in the version bounds added by this PR (1.0.0.0
-1.2.3.1
) seem to have pack
implemented not how we saw it.
And like, git blame
says the pack
implementation to be the second one forever.
That all feels really weird.
The only explanation I see is that all sufficiently old text
versions on Hackage got new revisions around 2020, and those revisions overwrite the internals. But like, it should be impossible to change a revision when there is no 100% guarantee it will not change the interface? And for such a change of pack
that we witness, it would be nearly impossible to prove with the necessary assurance level. So something wrong seems to be going on under the surface :thinking:
@Martoon-00 You seem to be looking at Data.Text.Lazy
instead of Data.Text
.
Meeh, right :facepalm:. Thanks for saving me from this madness.
Added rewrite rule to eliminate
toString . toText
conversions.Benchmarks without rewrite rule:
And with rewrite rule:
Description
Related issues(s)
Fixed #212.
✓ Checklist for your Pull Request
Ideally a PR has all of the checkmarks set.
If something in this list is irrelevant to your PR, you should still set this checkmark indicating that you are sure it is dealt with (be that by irrelevance).
Related changes (conditional)
Tests
Documentation
I checked whether I should update the docs and did so if necessary:
Record your changes
Stylistic guide (mandatory)
[x] My commit history is clean (only contains changes relating to my issue/pull request and no reverted-my-earlier-commit changes) and commit messages start with identifiers of related issues in square brackets.
Example:
[#42] Short commit description
If necessary both of these can be achieved even after the commits have been made/pushed using rebase and squash.