serokell / universum

:milky_way: Prelude written in @Serokell
MIT License
176 stars 28 forks source link

Add rewrite rules for text conversion #213

Closed Martoon-00 closed 5 years ago

Martoon-00 commented 5 years ago

Added rewrite rule to eliminate toString . toText conversions.

Benchmarks without rewrite rule:

benchmarked toText/toString
time                 1.013 ms   (875.1 μs .. 1.148 ms)
                     0.819 R²   (0.691 R² .. 0.904 R²)
mean                 1.621 ms   (1.268 ms .. 2.619 ms)
std dev              2.134 ms   (331.2 μs .. 4.071 ms)
variance introduced by outliers: 97% (severely inflated)

And with rewrite rule:

benchmarked toText/toString
time                 186.1 μs   (183.3 μs .. 188.4 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 189.4 μs   (187.7 μs .. 191.8 μs)
std dev              7.508 μs   (5.823 μs .. 10.36 μs)
variance introduced by outliers: 20% (moderately inflated)

Description

Related issues(s)

Fixed #212.

✓ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this checkmark indicating that you are sure it is dealt with (be that by irrelevance).

Related changes (conditional)

Stylistic guide (mandatory)

int-index commented 5 years ago

I tried to figure out why text couldn't do it by default.

Your rule seems equivalent to T.unpack (T.pack a) = a. Now, if we look at unpack and pack they are defined as

unpack = S.unstreamList . stream
{-# INLINE [1] unpack #-}

pack = unstream . S.map safe . S.streamList
{-# INLINE [1] pack #-}

After they get inlined, the rule seems to be

(S.unstreamList . stream) ((unstream . S.map safe . S.streamList) a)

If we also inline function composition, we get

S.unstreamList (stream (unstream (S.map safe (S.streamList a))))

stream and unstream surely cancel out via this rule:

{-# RULES "STREAM stream/unstream fusion" forall s. stream (unstream s) = s #-}

So we are left with

S.unstreamList (S.map safe (S.streamList a))

Now, what's this safe function? Turns out it's defined as

safe :: Char -> Char
safe c
    | ord c .&. 0x1ff800 /= 0xd800 = c
    | otherwise                    = '\xfffd'

Aha, so it's mapping some codepoints to '\xfffd'! There's a comment on top of it to explain this:

-- UTF-16 surrogate code points are not included in the set of Unicode
-- scalar values, but are unfortunately admitted as valid 'Char'
-- values by Haskell.  They cannot be represented in a 'Text'.  This
-- function remaps those code points to the Unicode replacement
-- character (U+FFFD, \'�\'), and leaves other code points
-- unchanged.

This logic is lost with your rewrite rule. Not a huge loss, but it does mean that your rewrite rule isn't meaning preserving. I bet you could write this rewrite rule instead:

{-# RULES "pack/unpack" forall s. S.unstreamList (S.map safe (S.streamList a)) = a #-}
Martoon-00 commented 5 years ago

Magnificent!

I tried to delve into sources, but stopped at not finding rules for pack/unpack :unamused:

The nice news is that with your rule I get the same results as with my incorrect approach, at least when applied to string literals. Pushed the fixed version.

int-index commented 5 years ago

My rule does not preserve semantics either, but it should play nicer with other rewrite rules from text.

gromakovsky commented 5 years ago

https://i.imgur.com/D1ZLGiG.png

Btw, Closes #212 should go to Related issues and description should go to Description :)

I had to add stringToText and textToString functions in order to specify inlining pragma as mentioned

I don't see these functions, did you delete them?

gromakovsky commented 5 years ago
/home/travis/build/serokell/universum/src/Universum/String/Conversion.hs:6:16:
    unknown flag in  {-# OPTIONS_GHC #-} pragma: -Wno-orphans

Sad story.

Martoon-00 commented 5 years ago

I don't see these functions, did you delete them?

Yes, forgot to update commit/issue description.

Martoon-00 commented 5 years ago

I'm trying to add a test on a character which is usually replaced by that safe, and apparently, the new rewrite rule doesn't fire there. It indeed fires in benchmarks, and if I replace the rule with forall s. T.unpack (T.pack s) = s, then it also works in tests.

My guess is that, since unstream (stream s) = s rule is applied at the last phase of the simplifier, probably our rule simply does not fire. I'd like to check for certain with -ddump-simpl/-ddump-rule-rewrites/--dump-inlinings, but their output looks meaningless to me (like, if there were no inlinings or rewrites at all except for inlining toString and toText sometimes).

@int-index do you have any thoughts on this?

int-index commented 5 years ago

It indeed fires in benchmarks, and if I replace the rule with forall s. T.unpack (T.pack s) = s, then it also works in tests.

I think my rule has a chance to fire after T.unpack and T.pack are inlined (and they are marked as {-# INLINE #-}), but the rule above has a chance to fire before they are inlined. Might as well keep both rules, or figure out what's going on with phases.

Martoon-00 commented 5 years ago

Kaef, it works.

Although I have a feeling that our tests are pretty unstable. :roll_eyes:

gromakovsky commented 5 years ago

@Martoon-00 btw, did you consider toText . toString? It's probably less useful with our usage patterns, but anyway, doesn't it make sense to add it as well? Or does it work out of box?

Martoon-00 commented 5 years ago

@Martoon-00 btw, did you consider toText . toString? It's probably less useful with our usage patterns, but anyway, doesn't it make sense to add it as well? Or does it work out of box?

Okay, why not add it.

gromakovsky commented 5 years ago

Ok, cool. Just in case: did you run benchmarks and verify that toText . toString rules actually improves anything?

gromakovsky commented 5 years ago

I assigned it fix type because I think that probability of breaking something (due to slightly changed semantics) is pretty much negligible.

Martoon-00 commented 5 years ago

Ok, cool. Just in case: did you run benchmarks and verify that toText . toString rules actually improves anything?

Yes, I got something like 0.2 ms with the rule against ~4 ms without it.

gromakovsky commented 5 years ago

It seems we should do #214 first.

gromakovsky commented 5 years ago

Probably #216 should let us merge this PR.

gromakovsky commented 5 years ago

Rebase on master and CI will hopefully pass here.

gromakovsky commented 5 years ago

The base branch requires all commits to be signed.

:trollface:

Martoon-00 commented 5 years ago

:tada:

Martoon-00 commented 1 year ago

Something happened that I completely cannot explain at the moment.

@int-index's observation shows that pack = unstream . S.map safe . S.streamList. I checked this statement back then, I couldn't miss checking it as in the such case I likely wouldn't be able to import those map and other methods.

And now when I open text's sources for that 1.2.3.1 version, instead I see pack = unstream . S.streamList . L.map safe. In fact, all versions of text in the version bounds added by this PR (1.0.0.0-1.2.3.1) seem to have pack implemented not how we saw it.

And like, git blame says the pack implementation to be the second one forever.

That all feels really weird.

The only explanation I see is that all sufficiently old text versions on Hackage got new revisions around 2020, and those revisions overwrite the internals. But like, it should be impossible to change a revision when there is no 100% guarantee it will not change the interface? And for such a change of pack that we witness, it would be nearly impossible to prove with the necessary assurance level. So something wrong seems to be going on under the surface :thinking:

int-index commented 1 year ago

@Martoon-00 You seem to be looking at Data.Text.Lazy instead of Data.Text.

Martoon-00 commented 1 year ago

Meeh, right :facepalm:. Thanks for saving me from this madness.