phadej / tdigest

On-line accumulation of rank-based statistics such as quantiles and trimmed means
30 stars 7 forks source link

SemiGroup laws #7

Open alang9 opened 7 years ago

alang9 commented 7 years ago

I don't think the Semigroup (TDigest comp) instance follows Semigroup laws, let alone Monoid laws

phadej commented 7 years ago

Why you think so. It for sure doesn't hold if you have structural equality, but it should hold "statically".

alang9 commented 7 years ago

No you can break associativity in pretty dramatic ways:

*Data.TDigest Data.Monoid> median (tdigest [1..500] <> (tdigest [501..1000] <> tdigest [1001..1500]) :: TDigest 2)
Just 739.1542067899707
*Data.TDigest Data.Monoid> median ((tdigest [1..500] <> tdigest [501..1000]) <> tdigest [1001..1500] :: TDigest 2)
Just 653.3709019894935
phadej commented 7 years ago

the 2 is quite small parameter: the t-digest is very unprecise whatever you do. Also monotonic series are bad for t-digest.

>  median (tdigest [1..500] :: TDigest 2) 
Just 239.15420678997074

TL;DR the linear cases and small configuration parameters are bad.

One way is to experiment with compression.

*Data.TDigest Data.Semigroup>  median (tdigest [1..500] <> tdigest [501..1000] <> tdigest [1001..1500] :: TDigest 10) 
Just 813.5614691299744
*Data.TDigest Data.Semigroup>  median ((tdigest [1..500] <> tdigest [501..1000]) <> tdigest [1001..1500] :: TDigest 10) 
Just 731.0510477190455
*Data.TDigest Data.Semigroup> tdigest' = forceCompress . tdigest
*Data.TDigest Data.Semigroup>  median (tdigest' [1..500] <> tdigest' [501..1000] <> tdigest' [1001..1500] :: TDigest 10) 
Just 745.129197936653
*Data.TDigest Data.Semigroup>  median ((tdigest' [1..500] <> tdigest' [501..1000]) <> tdigest' [1001..1500] :: TDigest 10) 
Just 745.1291979366533
alang9 commented 7 years ago

Right, I'm aware that small configurations parameters and monotonic cases are edge cases for the algorithm, that's why I'm testing with those.

My point is that the associativity law is broken not just structurally, but also observationally

phadej commented 7 years ago

It might make sense to do forceCompress in <>. I'm not 100% sure it's always what you want though (it has non-negligible performance cost). I'll add a newtype that does (<>) `on` forceCompress

Unfortunately you cannot make t-digest so that you cannot observe the difference in inserting stuff in different order:

*Data.TDigest Data.Semigroup>  median (tdigest ([1..5000]) :: TDigest 5)
Just 2500.001251754201
*Data.TDigest Data.Semigroup>  median (tdigest ([1,3..4999] ++ [2,4..5000]) :: TDigest 5)
Just 2500.3350037978526

and as there is no rigor analysis of the algorithm, I cannot be 100% sure how big errors are acceptable. Especially as the pure algorithm cannot do random shuffling as the paper advices.

alang9 commented 7 years ago

Unfortunately you cannot make t-digest so that you cannot observe the difference in inserting stuff in different order

Yes that's my point, and I think that's fine. I just don't think it makes a valid Semigroup instance

phadej commented 7 years ago

See #8

alang9 commented 7 years ago

Even if you do forceCompress in <>, it still doesn't make it satisfy associativity, right?

phadej commented 7 years ago

I won't argue about this. Yes, you can "see" the difference, but you shouldn't treat the values exactly.

You can see the differences with toList :: HashMap k v (insertion order matters), and Gen from QuickCheck (return x >>= f = f x):

*Test.QuickCheck Test.QuickCheck.Random Test.QuickCheck.Gen> unGen (arbitrary >>= \x -> fmap (+ x) arbitrary) (mkQCGen 42) 100 :: Int
12
*Test.QuickCheck Test.QuickCheck.Random Test.QuickCheck.Gen> unGen ((\x -> fmap (+ x) arbitrary) 1) (mkQCGen 42) 100 :: Int
88
phadej commented 7 years ago

I'll have to reimplement <>, as it seems having forceCompress in tdigest isn't good idea.