split is inefficient; can be sped up over 2x for single character delimiter

joeyh commented 7 years ago

Data.List.Utils.split if you profile it, does a lot of allocation.

With a 600 mb "data" file containing short words, I profiled this program:

import Data.List.Utils
main = do
        s <- readFile "data"
        let l = split " " s
        print (length l)

Tue Jan 31 19:49 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =      117.11 secs   (117106 ticks @ 1000 us, 1 processor)
    total alloc = 139,419,374,656 bytes  (excludes profiling overheads)

COST CENTRE MODULE           %time %alloc

spanList    Data.List.Utils   46.2   59.3
startswith  Data.List.Utils   28.4   14.1
MAIN        MAIN              14.7   17.8
split       Data.List.Utils    8.3    6.4
breakList   Data.List.Utils    2.5    2.4

Compare with this program using a splitc specialized to a single character delimiter:

main = do
    s <- readFile "data"
    let l = splitc ' ' s
    print (length l)

splitc :: Char -> String -> [String]
splitc _ [] = []
splitc c s = case break (== c) s of
    (i, d:rest) -> i : splitc c rest
    (i, []) -> i : []

    Tue Jan 31 19:54 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =       54.44 secs   (54435 ticks @ 1000 us, 1 processor)
    total alloc = 99,505,766,960 bytes  (excludes profiling overheads)

So over twice as fast!

A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.

joeyh commented 7 years ago

(Actually it's more like 3x as fast since the readFile of 500 mb is responsible for 20 seconds of the runtime.)

joeyh commented 7 years ago

Btw, http://hackage.haskell.org/package/split's splitOn is similarly slow to MissingH's split.

jgoerzen commented 7 years ago

Wow, good catch. It would be easy enough to add an optimized splitc for a single-character delimter, and have split use it automatically. Does that splitOn perform better than this split?

joeyh commented 7 years ago

split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.

jgoerzen commented 7 years ago

From way back in the MissingH history dating back to 2004 is another definition of split. I have not verified its correctness, but I wonder how it performs in comparison?

split :: Eq a => [a] -> [a] -> [[a]] split delim str = let splitworker :: Eq a => [a] -> [a] -> [a] -> [[a]] splitworker delim [] [] = [] splitworker delim [] accum = [accum] splitworker delim str accum = if delim == str then accum : [] : [] else if startswith delim str then accum : splitworker delim (drop (length delim) str) [] else splitworker delim (tail str) (accum ++ [head str]) in splitworker delim str []

On 02/14/2017 09:41 PM, Joey Hess wrote:

split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jgoerzen/missingh/issues/39#issuecomment-279908971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG5HXQ7yUm9jFEy1f8r51t4F9qcfgxSks5rcnPggaJpZM4LzVBB.

joeyh commented 7 years ago

This time I used this data file:

perl -e 'for (1..50000000) { print $_." "}' > data

The current MissingH split profile with that:

Wed Feb 15 12:14 2017 Time and Allocation Profiling Report  (Final)

   foo +RTS -p -RTS

total time  =       80.99 secs   (80991 ticks @ 1000 us, 1 processor)
total alloc = 99,733,231,232 bytes  (excludes profiling overheads)

COST CENTRE MODULE           %time %alloc

spanList    Data.List.Utils   47.7   59.3
startswith  Data.List.Utils   27.0   14.1
MAIN        MAIN              14.7   17.8
split       Data.List.Utils    8.0    6.4
breakList   Data.List.Utils    2.6    2.4

Compare with the 2004 version:

Wed Feb 15 12:17 2017 Time and Allocation Profiling Report  (Final)

   foo +RTS -p -RTS

total time  =       52.74 secs   (52739 ticks @ 1000 us, 1 processor)
total alloc = 47,066,563,520 bytes  (excludes profiling overheads)

COST CENTRE MODULE           %time %alloc

MAIN        MAIN              93.1   70.2
startswith  Data.List.Utils    6.9   29.8

Compare with splitc:

Wed Feb 15 12:19 2017 Time and Allocation Profiling Report  (Final)

   foo +RTS -p -RTS

total time  =       38.25 secs   (38251 ticks @ 1000 us, 1 processor)
total alloc = 71,155,452,896 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0

splitc still wins by a reasonable margin, interesting it has more allocations. The 2004 split is a nice improvement, assuming it's correct.

-- see shy jo

fyrchik commented 7 years ago

Hello. I think, 2004 split is much cutier and it could be further improved, to smth like this, for example:

split :: Eq a => [a] -> [a] -> [[a]]
split _ [] = []
split [x] str = splitc x str
split delim str =
    let splitworker :: Eq a => Int -> [a] -> [a] -> [a] -> [[a]]
        splitworker _ delim [] [] = []
        splitworker _ delim [] accum = [reverse accum]
        splitworker l delim str accum =
            let (a, t) = prefixLen 0 delim str
            in if null t then
               accum : [] : []
            else if a == l then
               accum : splitworker l delim t []
            else splitworker l delim (tail str) (head str : accum)
        prefixLen :: Eq a => Int -> [a] -> [a] -> (Int, [a])
        prefixLen acc (x:xs) l@(y:ys) = if x == y then prefixLen (acc+1) xs ys else (acc, l)
        prefixLen acc _ ys = (acc, ys)
        in
        splitworker (length delim) delim str []

(testing on equality and isPrefixOf have much in common + accumulating with (++) is bad i believe)

Actually, i also tried to speciallize split to 2-character delimiter and didn't see significant improvements (like in case with splitc)

On my system 2004 version runs ~41 seconds, version in this post ~ 29 seconds and splitc ~ 23 seconds for data generated above and single-character delimiter.

jgoerzen commented 7 years ago

Interesting. I wonder about multi-character delimiters? (This is one of the reasons for this split.) If it's no worse than the other implementation for those, then this is a slam dunk

fyrchik commented 7 years ago

I generated test file like this (version with 1000000 eats 3gb of memory when used with non-existing delimiter, be careful). perl -e 'for (1..1000000) { print $_."abcdefghij"}' >many10CharDelims Ill appreciate other tests on which default version could be the fastest.

And tested all 3 versions (default, 2004, 2004Opt) https://github.com/fyrchik/randomthings (pushed all stuff here with testScript and results)

For delimiters 123 and abcdefghij default version is the slowest, 2004 with optimizations the fastest. I also tried to use non-existent delimiter and here 2004 version was the slowest, but it still has consumed less memory than default version. 2004 with optimization outperformed both versions in tests with all 3 delimiters in terms of both time and memory (thought it loses to 2004-nonoptimized sometimes).

It is also interesting if 'spanList' could be more effective.

haskell-hvr / missingh

split is inefficient; can be sped up over 2x for single character delimiter #39