Data.Algorithm.Diff does not scale

After adding longer testcase to the testsuite for BNFC I was surprised to see the testsuite hanging. I then discovered that 13GB had been allocated (as much as my machine can offer). I traced the problem down to HTF and further to the diff algorithm used here, Data.Algorithm.Diff. It does not seem to scale to long strings with quite some differences.

However, when a test fails, I would be happy to see only the first couple of differences in long strings; if there are many differences, something fundamental must be wrong anyway.

I wonder whether assertEqual or the test runner could be configured to limit the number of differences shown and speed up the process.

However, I also have doubts in the quality of the library in use Data.Algorithm.Diff. It seems it uses reverse, making it impossible to only get the first n differences without computing all:

-- | A form of 'getDiff' with no 'Eq' constraint. Instead, an equality predicate
-- is taken as the first argument.
getDiffBy :: (a -> b -> Bool) -> [a] -> [b] -> [PolyDiff a b]
getDiffBy eq a b = markup a b . reverse $ lcs eq a b
    where markup ...

To get a feel for the memory consumption of this diff algorithm, here is a protocol:

$ ./Diff 1000 +RTS -s
Comparing strings of lengths 5888896 and 5887107
Containing 15761 differences
  83,067,632,840 bytes allocated in the heap
            4066 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     80176 colls,     0 par   13.886s  14.796s     0.0002s    0.0440s
  Gen  1        31 colls,     0 par   15.410s  17.389s     0.5609s    2.1239s

  MUT     time   18.025s  ( 18.951s elapsed)
  GC      time   29.297s  ( 32.184s elapsed)
  Total   time   47.322s  ( 51.144s elapsed)

That is already 4GB for strings of length 6M, and the memory consumption is linear!

I attach my benchmark program if you want to experiment further:

{-# OPTIONS_GHC -F -pgmF htfpp #-}
{-# LANGUAGE LambdaCase, ScopedTypeVariables #-}

import Data.List (groupBy)
import Data.Functor
import System.Environment

import Data.Algorithm.Diff (getDiff, PolyDiff(Both,First,Second))
import Test.Framework

main = do
  n :: Int <- getArgs <&> \case
    [s] -> read s
    []  -> 14
  let a = concat $ map show $ [1..n*n]
      b = concat $ map show $ concat $ map (\ i -> [n*i + 1  .. n*(i+1) - 1]) [1..n]
  putStrLn $ "Comparing strings of lengths " ++ show (length a) ++ " and " ++ show (length b)
  -- runTest $ assertEqual a b
  -- runTest $ assertEqual (a == b) True  -- without diff
  putStrLn $ "Containing (truncated to 100) "
    ++ show (length $ filter notSame $ getDiff a b) ++ " differences"  -- directly Diff

notSame = \case
  Both{} -> False
  _ -> True

-- Runs of @assertEqual a b@
---------------------------------------------------------------------------

-- $ ./Diff 30
-- Comparing strings of lengths 2592 and 2543
-- Total execution time: 8ms

-- 300/ms  if 2.500 chars

-- $ ./Diff 100
-- Comparing strings of lengths 38894 and 38808
-- Total execution time: 215ms
--   493,509,200 bytes allocated in the heap
--   23 MB total memory in use

-- 200/ms  if 40.000 chars  (0.2s)

-- $ ./Diff 300
-- Comparing strings of lengths 438894 and 438136
-- Total execution time: 2735ms
--   5,921,117,560 bytes allocated in the heap
--   269 MB total memory in use

-- 160/ms  if 440.000 chars  (2.7s)

-- $ ./Diff 500
-- Comparing strings of lengths 1388895 and 1387719
-- Total execution time: 9857ms
--   20,944,605,752 bytes allocated in the heap
--   867 MB total memory in use

-- 140/ms  if 1.400.000 chars  (10s)

-- $ ./Diff 1000
-- Comparing strings of lengths 5888896 and 5887107
-- Total execution time: 59925ms
-- Total execution time: 54541ms
--   98,255,990,936 bytes allocated in the heap
--   3888 MB total memory in use

-- 100/ms  if 6.000.000 chars  (60s)

-- $ ./Diff 1500
-- Comparing strings of lengths 14638896 and 14634738
-- Total execution time: 160551ms
--   248,975,720,128 bytes allocated in the heap
--   9892 MB total memory in use

-- 90/ms  if 15.000.000 chars  (160s)

-- $ ./Diff 2000
-- Comparing strings of lengths 26888896 and 26882552

-- <INTERRUPT>

-- Without diff (@assertEqual (a == b) True@)
---------------------------------------------------------------------------

-- $ ./Diff 1000 +RTS -s
-- Comparing strings of lengths 5888896 and 5887107

-- Total execution time: 0ms
--    1,587,810,664 bytes allocated in the heap
--              391 MB total memory in use

-- $ ./Diff 1500 +RTS -s
-- Comparing strings of lengths 14638896 and 14634738

-- Total execution time: 0ms
--    3,837,623,072 bytes allocated in the heap
--             1358 MB total memory in use (0 MB lost due to fragmentation)

-- $ ./Diff 2000 +RTS -s
-- Comparing strings of lengths 26888896 and 26882552

-- Total execution time: 0ms
--    6,987,537,384 bytes allocated in the heap
--             2630 MB total memory in use

-- Only Data.Algorithm.Diff
---------------------------------------------------------------------------

-- $ ./Diff 100 +RTS -s
-- Comparing strings of lengths 38894 and 38808
-- Containing 1066 differences
--      387,598,992 bytes allocated in the heap
--       72,168,480 bytes copied during GC
--       10,251,528 bytes maximum residency (7 sample(s))
--          185,080 bytes maximum slop
--               25 MB total memory in use (0 MB lost due to fragmentation)

--                                      Tot time (elapsed)  Avg pause  Max pause
--   Gen  0       367 colls,     0 par    0.058s   0.062s     0.0002s    0.0031s
--   Gen  1         7 colls,     0 par    0.026s   0.036s     0.0052s    0.0147s

--   INIT    time    0.000s  (  0.003s elapsed)
--   MUT     time    0.088s  (  0.094s elapsed)
--   GC      time    0.084s  (  0.098s elapsed)
--   EXIT    time    0.000s  (  0.006s elapsed)
--   Total   time    0.172s  (  0.202s elapsed)

--   %GC     time      48.8%  (48.7% elapsed)

-- $ ./Diff 1000 +RTS -s
-- Comparing strings of lengths 5888896 and 5887107
-- Containing 15761 differences
--   83,067,632,840 bytes allocated in the heap
--   33,259,227,112 bytes copied during GC
--    1,518,548,672 bytes maximum residency (31 sample(s))
--        8,911,168 bytes maximum slop
--             4066 MB total memory in use (0 MB lost due to fragmentation)

--                                      Tot time (elapsed)  Avg pause  Max pause
--   Gen  0     80176 colls,     0 par   13.886s  14.796s     0.0002s    0.0440s
--   Gen  1        31 colls,     0 par   15.410s  17.389s     0.5609s    2.1239s

--   INIT    time    0.000s  (  0.002s elapsed)
--   MUT     time   18.025s  ( 18.951s elapsed)
--   GC      time   29.297s  ( 32.184s elapsed)
--   EXIT    time    0.000s  (  0.006s elapsed)
--   Total   time   47.322s  ( 51.144s elapsed)

--   %GC     time      61.9%  (62.9% elapsed)

--   Alloc rate    4,608,394,976 bytes per MUT second

--   Productivity  38.1% of total user, 37.1% of total elapsed

skogsbaer / HTF

Data.Algorithm.Diff does not scale #85