Closed CrepeGoat closed 3 months ago
Thanks for this nice issue report @CrepeGoat!
One thing that comes to mind is that running with the test command (roc test
) will include the compilation time.
So for a more accurate comparison I would do roc build myfile.roc --optimize
and time ./myfile.roc
. Note that this would require removing expect and switching to Stdout.line
.
Using sortAsc
instead of sortWith
would also make the comparison more equal.
Timing inside roc like we did here may also be useful to gain additional insights.
Thanks for the kind words @Anton-4 :)
You're right, roc test
is probably not the best place to do performance benchmarks; I will keep this in mind for timing comparisons moving forwards. I personally just used roc test
because 1) it felt easier at the time, and 2) I didn't expect it to make a ~50x difference (i.e., the 0.36s Python run vs the 20s Roc run). Specifically, from my rough observation the Roc compilation likely took ~1-2 seconds - I'm not sure where the other 18 seconds went.
I'll try using sortAsc
and direct internal timing, thanks for the tips! as an aside, does |> List.sortAsc
not compile to the same code as |> List.sortWith Num.compare
?
does |> List.sortAsc not compile to the same code as |> List.sortWith Num.compare?
It appears they do :) I did not do a diligent check before.
I'm not sure where the other 18 seconds went.
Uhu, I do expect we do have a real performance problem here but it's good to narrow things down.
It appears they do :) I did not do a diligent check before.
well I did no check before 😂 so thanks for putting in that work to confirm 👍
it's good to narrow things down.
Agreed! I think we're on the same page here 🙂
Rewrote the example (mostly) as suggested:
main.roc
app "test"
packages {
pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.0/bkGby8jb0tmZYsy2hg1E_B2QrCgcSTxdUlHtETwm5m4.tar.br",
}
imports [
pf.Stdout,
pf.Task,
]
provides [main] to pf
main : Task.Task {} I32
main =
n = 86896
list = List.range { start: At n, end: Before 0 }
listSorted = List.sortAsc list
Stdout.line "hello"
Building and running this code now generates the following timing information:
$ roc build --optimize
...(build stuff)
$ time ./main
hello
./main 16.04s user 0.03s system 99% cpu 16.102 total
Running objdump -d main
generates ~600k lines of instructions, so I'm electing not to look into the exact instructions myself 😅
Thanks for the simplification @CrepeGoat. I don't personally have time to look into this but my recommended next step would be to build with roc build --optimize --profiling
, and make a flamegraph. This will allow you to see what the program is actually spending time on.
hmm is the --profiling
flag new to roc? I get the following error when I run that with my current roc version (github:roc-lang/roc/dc37b7a31d4fa2014ec159eda1d51c27832ddca6
):
$ roc build --optimize --profiling
error: unexpected argument '--profiling' found
tip: to pass '--profiling' as a value, use '-- --profiling'
Usage: roc build --optimize [ROC_FILE]
For more information, try '--help'.
Yes it's pretty new, it was merged in on the fifth of December in PR #6184.
Got it, thanks.
Updated my roc version and ran as instructed, seems to work. However now I can't, for the life of me, find a macOS-compatible gprof 🥲
The instruments app should be able to produce a trace, did you try that?
thanks for the suggestion! I gave instruments a shot:
--profiling
and without; the trace outputs look the same from both.I think the results are accurate, slowness because of memcpy is something we've seen before but I'm not sure why it is doing all this copying here. I'll investigate later today :)
I've started a discussion on zulip with more analysis and to get help in finding out the cause.
This is not yet ready to merge, but on the blitsort
branch (poorly named cause currently it is just quadsort and later will be fluxsort), I have some big sorting perf gains.
This is just python3 -c "n = 10000000; l = list(reversed(range(n))); l.sort(); l == list(range(n))"
in python and in roc. With a final print depending on if the lists are equal. Much faster in Roc now!
Benchmark 1 (16 runs): ./roc-sort-10000000
measurement mean ± σ min … max outliers delta
wall_time 309ms ± 4.40ms 304ms … 317ms 0 ( 0%) 0%
peak_rss 162MB ± 101KB 162MB … 162MB 0 ( 0%) 0%
cpu_cycles 1.11G ± 6.33M 1.09G … 1.12G 0 ( 0%) 0%
instructions 1.73G ± 39.4 1.73G … 1.73G 0 ( 0%) 0%
cache_references 86.2M ± 629K 85.8M … 88.4M 1 ( 6%) 0%
cache_misses 38.8M ± 591K 38.0M … 39.7M 0 ( 0%) 0%
branch_misses 43.6K ± 6.44K 37.4K … 64.1K 1 ( 6%) 0%
Benchmark 2 (7 runs): python3 python-sort-10000000.py
measurement mean ± σ min … max outliers delta
wall_time 752ms ± 36.0ms 729ms … 828ms 0 ( 0%) 💩+143.1% ± 6.0%
peak_rss 810MB ± 74.4KB 810MB … 811MB 0 ( 0%) 💩+400.7% ± 0.1%
cpu_cycles 1.94G ± 26.1M 1.89G … 1.98G 0 ( 0%) 💩+ 75.0% ± 1.3%
instructions 5.11G ± 44.1K 5.11G … 5.11G 0 ( 0%) 💩+195.5% ± 0.0%
cache_references 112M ± 237K 111M … 112M 0 ( 0%) 💩+ 29.7% ± 0.6%
cache_misses 74.0M ± 84.6K 73.9M … 74.1M 0 ( 0%) 💩+ 90.7% ± 1.2%
branch_misses 1.50M ± 524K 730K … 2.19M 0 ( 0%) 💩+3345.2% ± 606.3%
EDIT: Also, this is without special handling for reversed list, but with that enabled, roc takes 75ms here.
Kinda intriguing, the perf bottleneck here is now List.range
, which takes up 70% of the execution time generating the two lists of length 10000000
. To be fair, the sort is now just reversing inplace, but I think we may have room for improvement for how we generate ranges.
system
github:roc-lang/roc/dc37b7a31d4fa2014ec159eda1d51c27832ddca6
setup
main.roc
problem
Running
roc test --optimize
terminates after ~20 seconds with the following correct test output:expected behavior
The given code should run faster relative to other languages, as described on the Roc website.
Specifically, I consider the above code to be roughly equivalent to the following Python code:
Timing this code via the *nix
time
command produces the following output:Note that the above results are for a data input ~300x larger than in the Roc code. (Atm the roc code won't even run for data that large - see #6293.)