tonyyxliu / CUHKSZ-CSC4005

Project Materials for CUHK(SZ) Course CSC4005: Parallel Programming
MIT License
79 stars 31 forks source link

Optimization about PartB Sequential Part #27

Closed szjiozi closed 1 year ago

szjiozi commented 1 year ago

I have unrolled the nested loops and replace it using the 9 lines of codes as the hints say, but the performance is still far slower than the baseline, why is that?

noah822 commented 1 year ago

A follow up question. Is there any hard requirement that we must follow when we implement this? Or as long as we can achieve the performance, it will be ok.

Say, can I change the memory layout of image. If the answer is yes, does the time used to reconfigure the image count as a part of the reported time? Also, can I use O3 optimization instead of O2. My current implementation works way better under O3. But if I do not change anything of the source code provided, switching to O3 won't give me any noticeable benefit.

Unroll the nested loop does not give me any improvement as well.

tonyyxliu commented 1 year ago

Hi @noah822

I don't think there is any hard requirement like you said. Feel free to use anything you can think of to speedup your parallel programs. You can use -O3 for compilation, but you need to specify in your report that how did the -O3 compilation helped to speedup your program. For example, which optimization option took effect.

Improving the performance is an endless job. Please complete the basic parallel implementation first, and make sure that you know how to solve a parallel computing problem and how to write parallel programs with acceptable performance. After that, if you still have time and want to take one step further, you may consider some advanced optimization techniques for extra speedup and extra credits. After all, this course is for parallel programming, not for high-performance computing.

tonyyxliu commented 1 year ago

Hi @szjiozi

I think replacing the nested loop of the size-3 filter should be enough to get the baseline performance. Maybe the memory access sequence is not optimized? If you still cannot get the baseline performance, don't stick to the sequential program and focus on implementing the six parallel programs. After all, the sequential implementation will not be marked.