roc-lang / roc

A fast, friendly, functional language.
https://roc-lang.org
Universal Permissive License v1.0
4.46k stars 313 forks source link

`List.sortWith` is notably slower than sorting in other non-systems languages #6294

Closed CrepeGoat closed 3 months ago

CrepeGoat commented 11 months ago

system

setup

main.roc

# references
# - https://github.com/roc-lang/examples/tree/main/examples/CommandLineArgs
app "main"
    packages {
        pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.0/bkGby8jb0tmZYsy2hg1E_B2QrCgcSTxdUlHtETwm5m4.tar.br",
    }
    imports [pf.Stdout, pf.Task]
    provides [main] to pf

main : Task.Task {} I32
main =
    Stdout.line "w/e"

expect
    n = 86896
    list = List.range { start: At n, end: Before 0 }
    List.sortWith list Num.compare == List.range { start: At 1, end: At n }

problem

Running roc test --optimize terminates after ~20 seconds with the following correct test output:

$ roc test --optimize

0 failed and 1 passed in 20094 ms.

expected behavior

The given code should run faster relative to other languages, as described on the Roc website.

Specifically, I consider the above code to be roughly equivalent to the following Python code:

n = 10000000 # a much larger value!
l = list(reversed(range(n)))
l.sort()
l == list(range(n))

Timing this code via the *nix time command produces the following output:

$ time python3 -c "n = 10000000; l = list(reversed(range(n))); l.sort(); l == list(range(n))"
python3 -c   0.36s user 0.10s system 96% cpu 0.475 total

Note that the above results are for a data input ~300x larger than in the Roc code. (Atm the roc code won't even run for data that large - see #6293.)

Anton-4 commented 11 months ago

Thanks for this nice issue report @CrepeGoat! One thing that comes to mind is that running with the test command (roc test) will include the compilation time. So for a more accurate comparison I would do roc build myfile.roc --optimize and time ./myfile.roc. Note that this would require removing expect and switching to Stdout.line.

Anton-4 commented 11 months ago

Using sortAsc instead of sortWith would also make the comparison more equal. Timing inside roc like we did here may also be useful to gain additional insights.

CrepeGoat commented 11 months ago

Thanks for the kind words @Anton-4 :)

You're right, roc test is probably not the best place to do performance benchmarks; I will keep this in mind for timing comparisons moving forwards. I personally just used roc test because 1) it felt easier at the time, and 2) I didn't expect it to make a ~50x difference (i.e., the 0.36s Python run vs the 20s Roc run). Specifically, from my rough observation the Roc compilation likely took ~1-2 seconds - I'm not sure where the other 18 seconds went.

I'll try using sortAsc and direct internal timing, thanks for the tips! as an aside, does |> List.sortAsc not compile to the same code as |> List.sortWith Num.compare?

Anton-4 commented 11 months ago

does |> List.sortAsc not compile to the same code as |> List.sortWith Num.compare?

It appears they do :) I did not do a diligent check before.

I'm not sure where the other 18 seconds went.

Uhu, I do expect we do have a real performance problem here but it's good to narrow things down.

CrepeGoat commented 11 months ago

It appears they do :) I did not do a diligent check before.

well I did no check before 😂 so thanks for putting in that work to confirm 👍

it's good to narrow things down.

Agreed! I think we're on the same page here 🙂

CrepeGoat commented 10 months ago

Rewrote the example (mostly) as suggested:

main.roc

app "test"
    packages {
        pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.0/bkGby8jb0tmZYsy2hg1E_B2QrCgcSTxdUlHtETwm5m4.tar.br",
    }
    imports [
        pf.Stdout,
        pf.Task,
    ]
    provides [main] to pf

main : Task.Task {} I32
main =
    n = 86896
    list = List.range { start: At n, end: Before 0 }
    listSorted = List.sortAsc list
    Stdout.line "hello"

Building and running this code now generates the following timing information:

$ roc build --optimize
...(build stuff)

$ time ./main
hello
./main  16.04s user 0.03s system 99% cpu 16.102 total

Running objdump -d main generates ~600k lines of instructions, so I'm electing not to look into the exact instructions myself 😅

Anton-4 commented 10 months ago

Thanks for the simplification @CrepeGoat. I don't personally have time to look into this but my recommended next step would be to build with roc build --optimize --profiling, and make a flamegraph. This will allow you to see what the program is actually spending time on.

CrepeGoat commented 10 months ago

hmm is the --profiling flag new to roc? I get the following error when I run that with my current roc version (github:roc-lang/roc/dc37b7a31d4fa2014ec159eda1d51c27832ddca6):

$ roc build --optimize --profiling
error: unexpected argument '--profiling' found

  tip: to pass '--profiling' as a value, use '-- --profiling'

Usage: roc build --optimize [ROC_FILE]

For more information, try '--help'.
Anton-4 commented 10 months ago

Yes it's pretty new, it was merged in on the fifth of December in PR #6184.

CrepeGoat commented 10 months ago

Got it, thanks.

Updated my roc version and ran as instructed, seems to work. However now I can't, for the life of me, find a macOS-compatible gprof 🥲

Anton-4 commented 10 months ago

The instruments app should be able to produce a trace, did you try that?

CrepeGoat commented 10 months ago

thanks for the suggestion! I gave instruments a shot:

Anton-4 commented 10 months ago

I think the results are accurate, slowness because of memcpy is something we've seen before but I'm not sure why it is doing all this copying here. I'll investigate later today :)

Anton-4 commented 10 months ago

I've started a discussion on zulip with more analysis and to get help in finding out the cause.

bhansconnect commented 3 months ago

This is not yet ready to merge, but on the blitsort branch (poorly named cause currently it is just quadsort and later will be fluxsort), I have some big sorting perf gains.

This is just python3 -c "n = 10000000; l = list(reversed(range(n))); l.sort(); l == list(range(n))" in python and in roc. With a final print depending on if the lists are equal. Much faster in Roc now!

Benchmark 1 (16 runs): ./roc-sort-10000000
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           309ms ± 4.40ms     304ms …  317ms          0 ( 0%)        0%
  peak_rss            162MB ±  101KB     162MB …  162MB          0 ( 0%)        0%
  cpu_cycles         1.11G  ± 6.33M     1.09G  … 1.12G           0 ( 0%)        0%
  instructions       1.73G  ± 39.4      1.73G  … 1.73G           0 ( 0%)        0%
  cache_references   86.2M  ±  629K     85.8M  … 88.4M           1 ( 6%)        0%
  cache_misses       38.8M  ±  591K     38.0M  … 39.7M           0 ( 0%)        0%
  branch_misses      43.6K  ± 6.44K     37.4K  … 64.1K           1 ( 6%)        0%
Benchmark 2 (7 runs): python3 python-sort-10000000.py
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           752ms ± 36.0ms     729ms …  828ms          0 ( 0%)        💩+143.1% ±  6.0%
  peak_rss            810MB ± 74.4KB     810MB …  811MB          0 ( 0%)        💩+400.7% ±  0.1%
  cpu_cycles         1.94G  ± 26.1M     1.89G  … 1.98G           0 ( 0%)        💩+ 75.0% ±  1.3%
  instructions       5.11G  ± 44.1K     5.11G  … 5.11G           0 ( 0%)        💩+195.5% ±  0.0%
  cache_references    112M  ±  237K      111M  …  112M           0 ( 0%)        💩+ 29.7% ±  0.6%
  cache_misses       74.0M  ± 84.6K     73.9M  … 74.1M           0 ( 0%)        💩+ 90.7% ±  1.2%
  branch_misses      1.50M  ±  524K      730K  … 2.19M           0 ( 0%)        💩+3345.2% ± 606.3%

EDIT: Also, this is without special handling for reversed list, but with that enabled, roc takes 75ms here.

bhansconnect commented 3 months ago

Kinda intriguing, the perf bottleneck here is now List.range, which takes up 70% of the execution time generating the two lists of length 10000000. To be fair, the sort is now just reversing inplace, but I think we may have room for improvement for how we generate ranges.