Open joeyh opened 7 years ago
(Actually it's more like 3x as fast since the readFile of 500 mb is responsible for 20 seconds of the runtime.)
Btw, http://hackage.haskell.org/package/split's splitOn is similarly slow to MissingH's split.
Wow, good catch. It would be easy enough to add an optimized splitc for a single-character delimter, and have split use it automatically. Does that splitOn perform better than this split?
split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.
From way back in the MissingH history dating back to 2004 is another definition of split. I have not verified its correctness, but I wonder how it performs in comparison?
split :: Eq a => [a] -> [a] -> [[a]] split delim str = let splitworker :: Eq a => [a] -> [a] -> [a] -> [[a]] splitworker delim [] [] = [] splitworker delim [] accum = [accum] splitworker delim str accum = if delim == str then accum : [] : [] else if startswith delim str then accum : splitworker delim (drop (length delim) str) [] else splitworker delim (tail str) (accum ++ [head str]) in splitworker delim str []
On 02/14/2017 09:41 PM, Joey Hess wrote:
split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jgoerzen/missingh/issues/39#issuecomment-279908971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG5HXQ7yUm9jFEy1f8r51t4F9qcfgxSks5rcnPggaJpZM4LzVBB.
This time I used this data file:
perl -e 'for (1..50000000) { print $_." "}' > data
The current MissingH split profile with that:
Wed Feb 15 12:14 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 80.99 secs (80991 ticks @ 1000 us, 1 processor)
total alloc = 99,733,231,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 47.7 59.3
startswith Data.List.Utils 27.0 14.1
MAIN MAIN 14.7 17.8
split Data.List.Utils 8.0 6.4
breakList Data.List.Utils 2.6 2.4
Compare with the 2004 version:
Wed Feb 15 12:17 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 52.74 secs (52739 ticks @ 1000 us, 1 processor)
total alloc = 47,066,563,520 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 93.1 70.2
startswith Data.List.Utils 6.9 29.8
Compare with splitc:
Wed Feb 15 12:19 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 38.25 secs (38251 ticks @ 1000 us, 1 processor)
total alloc = 71,155,452,896 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 100.0 100.0
splitc still wins by a reasonable margin, interesting it has more allocations. The 2004 split is a nice improvement, assuming it's correct.
-- see shy jo
Hello. I think, 2004 split is much cutier and it could be further improved, to smth like this, for example:
split :: Eq a => [a] -> [a] -> [[a]]
split _ [] = []
split [x] str = splitc x str
split delim str =
let splitworker :: Eq a => Int -> [a] -> [a] -> [a] -> [[a]]
splitworker _ delim [] [] = []
splitworker _ delim [] accum = [reverse accum]
splitworker l delim str accum =
let (a, t) = prefixLen 0 delim str
in if null t then
accum : [] : []
else if a == l then
accum : splitworker l delim t []
else splitworker l delim (tail str) (head str : accum)
prefixLen :: Eq a => Int -> [a] -> [a] -> (Int, [a])
prefixLen acc (x:xs) l@(y:ys) = if x == y then prefixLen (acc+1) xs ys else (acc, l)
prefixLen acc _ ys = (acc, ys)
in
splitworker (length delim) delim str []
(testing on equality and isPrefixOf have much in common + accumulating with (++) is bad i believe)
Actually, i also tried to speciallize split to 2-character delimiter and didn't see significant improvements (like in case with splitc)
On my system 2004 version runs ~41 seconds, version in this post ~ 29 seconds and splitc ~ 23 seconds for data generated above and single-character delimiter.
Interesting. I wonder about multi-character delimiters? (This is one of the reasons for this split.) If it's no worse than the other implementation for those, then this is a slam dunk
I generated test file like this (version with 1000000 eats 3gb of memory when used with non-existing delimiter, be careful).
perl -e 'for (1..1000000) { print $_."abcdefghij"}' >many10CharDelims
Ill appreciate other tests on which default version could be the fastest.
And tested all 3 versions (default, 2004, 2004Opt) https://github.com/fyrchik/randomthings (pushed all stuff here with testScript and results)
For delimiters 123 and abcdefghij default version is the slowest, 2004 with optimizations the fastest. I also tried to use non-existent delimiter and here 2004 version was the slowest, but it still has consumed less memory than default version. 2004 with optimization outperformed both versions in tests with all 3 delimiters in terms of both time and memory (thought it loses to 2004-nonoptimized sometimes).
It is also interesting if 'spanList' could be more effective.
Data.List.Utils.split if you profile it, does a lot of allocation.
With a 600 mb "data" file containing short words, I profiled this program:
Compare with this program using a splitc specialized to a single character delimiter:
So over twice as fast!
A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.