[Final Project] Performance Competition (substituting final exam, due: 8th of July)

jeehoonkang commented 4 years ago

In turned out that we cannot physically gather for the final exam. So I decided not to take the final exam. Instead, as the substitute task, we'll have a performance competition as the final project.

For the competition, you’ll submit your entire compiler. Predefined benchmark programs will be compiled and then executed on Hifive Unleashed (the first Linux-bootable RISC-V development board), which is sponsored by SemiFive. If any of the results are wrong, you’ll be disqualified. The geometric average of the number of CPU cycles will be compared among students’ compilers and clang -O1.

Please do whatever you can to reduce the number of cycles, e.g., by implementing more optimizations or by improving your asmgen with a better register allocation algorithm.

If your compiler is better than clang -O1, you’ll get A#. If your compiler is better than those of most students, you’ll get A+. Depending on the performance of your compiler, you'll get some bonus.

cyron1259 commented 4 years ago

Would there be some kind of a leaderboard so that we can compare against others' performance?

jeehoonkang commented 4 years ago

@cyron1259 good idea! we will soon prepare for a leaderboard.

hestati63 commented 4 years ago

As the whole compiler will be run on the final competition, I want to fuzz each optimization pass. But the fuzzer does not support it. Can you make options to fuzz the optimization pass?

Also, can you provide the command line argument for the compiler that will be used in the competition?

jeehoonkang commented 4 years ago

@hestati63

on fuzzing optimizations, let's discuss here: #178
on competition's specification, I will soon prepare for a detailed description of the submission procedure, competition rule, leaderboard, etc. For now, let's say (1) you can implement custom optimizations on IR; and (2) you can optimize the naive asmgen introduced in the lecture videos.

jeehoonkang commented 4 years ago

Clarification: you need to observe the LP64D calling convention: https://github.com/kaist-cp/cs420/issues/209#issuecomment-643937143

jeehoonkang commented 4 years ago

IMPORTANT UPDATE on FINAL PROJECT

Benchmark code is uploaded: https://github.com/kaist-cp/kecc-public/commit/114f38cbb6e9037b1d8c706abf814f1518a4c579
In the bench directory, execute make run. Then it will build your compiler, build benchmark codes, run them, and measure the elapsed CPU cycles. The average is your score (lower is better).
For the time being, it's running on QEMU and the measurement is not accurate. I will soon provide a gg.kaist.ac.kr submission link so that you can run the benchmark codes on the SiFive HiFive Unleashed RISC-V machine.
Benchmark codes will be added in the near future.

hestati63 commented 4 years ago

Can you notify a specific deadline that you finalizes the benchmark codes?

cmpark0126 commented 4 years ago

IMPORTANT: you should use la pseudo instruction instead of HI20, LO12 pair when obtaining the address of the global variable. We create a shared object using the assembly code to check the performance of the compiler on the final project. However, the relocation function HI20 and LO12 can not be used when making a shared object. Instead, you can generate a shared object normally by using la instruction.

So, please use la pseudo instruction instead of HI20, LO12 pair like below:

# before
lui     a5,%hi(nonce)
lw      a5,%lo(nonce)(a5)

# after
la      a5,nonce

jeehoonkang commented 4 years ago

IMPORTANT:

I just uploaded the final project grader: https://github.com/kaist-cp/kecc-public/commit/542535fbd66c2ce4c0ae12b7b9b7fe135fb79d68 Please do whatever you want to improve cd bench; make run's "[AVERAGE]" score (lower is better). We recommend you to read driver.cpp.

@hestati63 Sorry for uploading the grader late. It's now finalized.
You'll upload your entire src directory. Please run ./scripts/make-submissions.sh and final.zip is the file you'll upload to gg (TBA).

Medowhill commented 4 years ago

Hi. Could you let us know the scores of some reference compilers (for example, gcc -O0 and gcc -O1)? Currently, it is hard to know whether my implementation performs well or not by only seeing the score. Also, if one tries to challenge gcc / clang -O1, those scores can be good targets.

jeehoonkang commented 4 years ago

@Medowhill make run-gcc will evaluate GCC with the optimization flag -O for the same benchmark. You can easily change Makefile to evaluate gcc -O0 and gcc -O1 as well.

Medowhill commented 4 years ago

Thank you! I didn't notice that.

hestati63 commented 4 years ago

When will be gg grader ready? As qemu uses binary translation, the cycle looks like just dependent on the number of instructions.

jeehoonkang commented 4 years ago

@hestati63 I'm trying to provide the grader by tomorrow. Sorry for delay.

jeehoonkang commented 4 years ago

You can submit the final project to gg now: https://gg.kaist.ac.kr/assignment/16/
It's running on a RISC-V machine: SiFive HiFive Unleashed running Linux

jeehoonkang commented 4 years ago

FYI, gcc -O's result is as follows:

[exotic_arguments_struct_small] 52
[exotic_arguments_struct_large] 77
[exotic_arguments_struct_small_ugly] 34
[exotic_arguments_struct_large_ugly] 138
[exotic_arguments_float] 18
[exotic_arguments_double] 19
[fibonacci_recursive] 52089252
[fibonacci_loop] 1640
[two_dimension_array] 72229
[matrix_mul] 373849
[matrix_add] 53248
[graph_dijkstra] 78160627
[graph_floyd_warshall] 151599746
[fibonacci_recursive] 52089692
[fibonacci_loop] 1787
[two_dimension_array] 74329
[matrix_mul] 372201
[matrix_add] 57362
[graph_dijkstra] 79328213
[graph_floyd_warshall] 151565586
[fibonacci_recursive] 52048084
[fibonacci_loop] 1754
[two_dimension_array] 72840
[matrix_mul] 377440
[matrix_add] 55333
[graph_dijkstra] 78468837
[graph_floyd_warshall] 151555926
[fibonacci_recursive] 52089934
[fibonacci_loop] 1759
[two_dimension_array] 72798
[matrix_mul] 372444
[matrix_add] 52443
[graph_dijkstra] 75623586
[graph_floyd_warshall] 151648428
[fibonacci_recursive] 52082904
[fibonacci_loop] 1755
[two_dimension_array] 72791
[matrix_mul] 373361
[matrix_add] 54438
[graph_dijkstra] 76326790
[graph_floyd_warshall] 151566425
[fibonacci_recursive] 52048784
[fibonacci_loop] 1782
[two_dimension_array] 72896
[matrix_mul] 379304
[matrix_add] 52175
[graph_dijkstra] 76327618
[graph_floyd_warshall] 151525424
[fibonacci_recursive] 52046427
[fibonacci_loop] 1758
[two_dimension_array] 72529
[matrix_mul] 370371
[matrix_add] 53737
[graph_dijkstra] 77295262
[graph_floyd_warshall] 151636428
[fibonacci_recursive] 52042896
[fibonacci_loop] 1775
[two_dimension_array] 72726
[matrix_mul] 376777
[matrix_add] 55529
[graph_dijkstra] 75582433
[graph_floyd_warshall] 151547499
[fibonacci_recursive] 52043659
[fibonacci_loop] 1896
[two_dimension_array] 72959
[matrix_mul] 370099
[matrix_add] 53497
[graph_dijkstra] 75628374
[graph_floyd_warshall] 151557803
[fibonacci_recursive] 52043419
[fibonacci_loop] 1780
[two_dimension_array] 72684
[matrix_mul] 373757
[matrix_add] 57870
[graph_dijkstra] 79321067
[graph_floyd_warshall] 151603631
[AVERAGE] 1.06947e+06

lomotos10 commented 4 years ago

IMPORTANT: you should use la pseudo instruction instead of HI20, LO12 pair when obtaining the address of the global variable. We create a shared object using the assembly code to check the performance of the compiler on the final project. However, the relocation function HI20 and LO12 can not be used when making a shared object. Instead, you can generate a shared object normally by using la instruction.

So, please use la pseudo instruction instead of HI20, LO12 pair like below:
# before
lui     a5,%hi(nonce)
lw      a5,%lo(nonce)(a5)
# after
la      a5,nonce

@cmpark0126 I am currently having trouble understanding the la instruction. Does la return the address of the label, or the data inside that address?

cmpark0126 commented 4 years ago

@cmpark0126 I am currently having trouble understanding the la instruction. Does la return the address of the label, or the data inside that address?

The address of the label

jesper-amilon commented 4 years ago

Should we use the la-instruction only for the Nonce-object or for all global variables?

cmpark0126 commented 4 years ago

@christofides You need to use la instruction for all global variables.

jesper-amilon commented 4 years ago

@cmpark0126 I am currently having trouble understanding the la instruction. Does la return the address of the label, or the data inside that address?

The address of the label

So if it loads the address, we need also add lw to actually load the value of the variable? I.e.:

la     a5, nonce
lw     a5,  a5

Edit: Another question, can LA be used to get the address of also floating point variables? (I assume this is the case but want to make sure)

cmpark0126 commented 4 years ago

@christofides

Yes, you need to add load instruction to get the value of the global variable like below:
```
la     a5, nonce
lw     a5,  0(a5)
```
Yes, you can use la instruction for a global variable whose type is floating-point.

kaist-cp / cs420

[Final Project] Performance Competition (substituting final exam, due: 8th of July) #168