Open alang9 opened 7 years ago
Why you think so. It for sure doesn't hold if you have structural equality, but it should hold "statically".
No you can break associativity in pretty dramatic ways:
*Data.TDigest Data.Monoid> median (tdigest [1..500] <> (tdigest [501..1000] <> tdigest [1001..1500]) :: TDigest 2)
Just 739.1542067899707
*Data.TDigest Data.Monoid> median ((tdigest [1..500] <> tdigest [501..1000]) <> tdigest [1001..1500] :: TDigest 2)
Just 653.3709019894935
the 2
is quite small parameter: the t-digest is very unprecise whatever you do. Also monotonic series are bad for t-digest.
> median (tdigest [1..500] :: TDigest 2)
Just 239.15420678997074
TL;DR the linear cases and small configuration parameters are bad.
One way is to experiment with compression.
*Data.TDigest Data.Semigroup> median (tdigest [1..500] <> tdigest [501..1000] <> tdigest [1001..1500] :: TDigest 10)
Just 813.5614691299744
*Data.TDigest Data.Semigroup> median ((tdigest [1..500] <> tdigest [501..1000]) <> tdigest [1001..1500] :: TDigest 10)
Just 731.0510477190455
*Data.TDigest Data.Semigroup> tdigest' = forceCompress . tdigest
*Data.TDigest Data.Semigroup> median (tdigest' [1..500] <> tdigest' [501..1000] <> tdigest' [1001..1500] :: TDigest 10)
Just 745.129197936653
*Data.TDigest Data.Semigroup> median ((tdigest' [1..500] <> tdigest' [501..1000]) <> tdigest' [1001..1500] :: TDigest 10)
Just 745.1291979366533
Right, I'm aware that small configurations parameters and monotonic cases are edge cases for the algorithm, that's why I'm testing with those.
My point is that the associativity law is broken not just structurally, but also observationally
It might make sense to do forceCompress
in <>
. I'm not 100% sure it's always what you want though (it has non-negligible performance cost). I'll add a newtype
that does (<>) `on` forceCompress
Unfortunately you cannot make t-digest so that you cannot observe the difference in inserting stuff in different order:
*Data.TDigest Data.Semigroup> median (tdigest ([1..5000]) :: TDigest 5)
Just 2500.001251754201
*Data.TDigest Data.Semigroup> median (tdigest ([1,3..4999] ++ [2,4..5000]) :: TDigest 5)
Just 2500.3350037978526
and as there is no rigor analysis of the algorithm, I cannot be 100% sure how big errors are acceptable. Especially as the pure algorithm cannot do random shuffling as the paper advices.
Unfortunately you cannot make t-digest so that you cannot observe the difference in inserting stuff in different order
Yes that's my point, and I think that's fine. I just don't think it makes a valid Semigroup
instance
See #8
Even if you do forceCompress
in <>
, it still doesn't make it satisfy associativity, right?
I won't argue about this. Yes, you can "see" the difference, but you shouldn't treat the values exactly.
You can see the differences with toList :: HashMap k v
(insertion order matters), and Gen
from QuickCheck
(return x >>= f = f x
):
*Test.QuickCheck Test.QuickCheck.Random Test.QuickCheck.Gen> unGen (arbitrary >>= \x -> fmap (+ x) arbitrary) (mkQCGen 42) 100 :: Int
12
*Test.QuickCheck Test.QuickCheck.Random Test.QuickCheck.Gen> unGen ((\x -> fmap (+ x) arbitrary) 1) (mkQCGen 42) 100 :: Int
88
I'll have to reimplement <>
, as it seems having forceCompress
in tdigest
isn't good idea.
I don't think the
Semigroup (TDigest comp)
instance follows Semigroup laws, let alone Monoid laws