Optimization for eliminating redundant memory operations

yyzdtccjdtc commented 2 years ago

Optimization for eliminating redundant memory operations related to this issue. https://github.com/sharkdp/pastel/issues/163#issue-1223375579

sharkdp commented 2 years ago

Nice. Can you tell us a bit more about how you benchmarked this?

yyzdtccjdtc commented 2 years ago

Nice. Can you tell us a bit more about how you benchmarked this?

I tested it with the command line 'pastel distinct 2000 >/dev/null' and then took the average execution time. According to my test results, the running time was reduced from 9.54 seconds to 8.41 seconds, which has a 13.4% speedup.

Also I compiled it with the opt-level = 3.

sharkdp commented 2 years ago

Unfortunately, I can not reproduce these benchmark results:

Command	Mean [s]	Min [s]	Max [s]	Relative
`./pastel-master distinct 200`	1.166 ± 0.023	1.141	1.208	1.00
`./pastel-165 distinct 200`	1.252 ± 0.023	1.215	1.298	1.07 ± 0.03

These results were created using a normal cargo build --release build. Benchmarking was done with hyperfine:

hyperfine -L version master,165 --warmup 3 './pastel-{version} distinct 200' --export-markdown results.md

yyzdtccjdtc commented 2 years ago

Unfortunately, I can not reproduce these benchmark results:

Command Mean [s] Min [s] Max [s] Relative ./pastel-master distinct 200 1.166 ± 0.023 1.141 1.208 1.00 ./pastel-165 distinct 200 1.252 ± 0.023 1.215 1.298 1.07 ± 0.03 These results were created using a normal cargo build --release build. Benchmarking was done with hyperfine:
hyperfine -L version master,165 --warmup 3 './pastel-{version} distinct 200' --export-markdown results.md

I believe if you try to change it to distinct 2000, you can see the difference.

sharkdp commented 2 years ago

pastel distinct 2000 is not really a realistic use-case, to be honest. Everything up to 100... maybe. But who wants to generate 2000 "visually distinct" colors?

yyzdtccjdtc commented 2 years ago

By carefully looking into the code I found the problem. The Lab structures in the Vec<(Color, Lab)> are generated here.

https://github.com/sharkdp/pastel/blob/47f9ddd59ea14df8d9d1f1073d14501751a49fd0/src/distinct.rs#L82-L85

So my get_labs function to get the Lab structure out is a redundant operation.

    pub fn get_labs(&self) -> Vec<Lab> {
        self.colors.iter().map(|(_, l)| l.clone()).collect()
    }

My current approach is to generate two separate labs and colors vectors directly at the time of this SimulatedAnnealing structure creation.

According to my tests with ./pastel distinct 200 >results.md, the average execution time decreased from 0.7158s to 0.6389s, which is a 12% speedup. Hopefully, your test will achieve the same results as mine :)

sharkdp commented 2 years ago

Now I can reproduce those results - thank you!

yyzdtccjdtc commented 2 years ago

I changed all labs to lab_values and fixed the clippy warning. I also replaced the derive copy with clone() function at the same time.

The struct contains 4 64bit floats, so it's probably cheaper to usually pass it by reference, not by value, right?

I also looked into this issue a bit. I want to state in advance that this is a compiler sensitive issue.According to the output file of objdump, I found that if we use the reference like this let at_lab = &lab_values[color], it will load the address of lab_values[color] first, then push this address one the stack, and finally it will first load the address from the stack and then load the three values from this address (because cie76 and ciede2000 functions only need the first three elements of Lab struct to do the calculation). So a total of 6 memory operations were performed.

But if we use copy() or clone() let at_lab = lab_values[color].clone();, it will load the values directly with two movupd instruction which can load 128 bits at a time. It then does the same three load operations as the reference to do the calculation. There are a total of 5 memory operations here, one less than reference.

And the execution time also proves this, the average time of reference is 0.68s (1.06x speedup), while the average time of clone is 0.64s (1.12x speedup).

If you think this modification is not so generic, I can also remove it and keep only the code related to the optimization of the separation of colors and lab_values vectors.

yyzdtccjdtc commented 2 years ago

@sharkdp Please tell me if there are any other changes I need to make?

sharkdp commented 2 years ago

No, looks good. Thank you very much!

sharkdp / pastel

Optimization for eliminating redundant memory operations #165