std::tie is slower than expected for symbiflow benchmark tests

xuqinziyue commented 4 years ago

comparison using std::tie in operator<(const t_override& lhs, const t_override& rhs) for struct t_override in place_delay_model.h seems to be slower than expected

Expected Behaviour

The function is supposed to only do several comparison between short numbers, so it is not expected to consume too much execution time.

Current Behaviour

gprof shows the function is consuming more than 8% of place time on picosoc benchmark test for symbiflow

Possible Solution

I replaced the std::tie with following code and there seems to be around 9% reduction in place time for the xc7 tests I measured on:

        unsigned long long temp_left = (unsigned long long)lhs.from_type<<48 | (unsigned long long)lhs.to_type<<32 | (unsigned long long)lhs.from_class << 16 | (unsigned long long)lhs.to_class;
        unsigned long long temp_right = (unsigned long long)rhs.from_type<<48 | (unsigned long long)rhs.to_type<<32 | (unsigned long long)rhs.from_class << 16 | (unsigned long long)rhs.to_class;
        if (temp_left < temp_right) {
            return true;
        }
        else if (temp_left == temp_right) {
            temp_left = ((unsigned)lhs.delta_x << 16 | (unsigned)lhs.delta_y);
            temp_right = ((unsigned)rhs.delta_x << 16 | (unsigned)rhs.delta_y);
            return temp_left < temp_right;
        }
        else {
            return false;
        }

| Original place time | New place time top_bram_n1 | 2.39 | 2.15 top_bram_n2 | 3.97 | 3.68 top_bram_n3 | 6.09 | 5.51 murax_basys | 6.3 | 5.69 picosoc_basys3 | 38.7 | 35.11

Your Environment

VTR revision used: symbiflow/vtr with commit number 7351b7cfcdea5807e58971cf463249101a87d444
Operating System and version: linux

vaughnbetz commented 4 years ago

Adding @kmurray to comment. Easton, are other QoR metrics for Symbiflow unaffected (should be, but best to check). The next step is to turn on the placement_delay override for the VTR and Titan benchmarks (see https://docs.verilogtorouting.org/en/latest/vpr/command_line_usage/) as it is a non-default option, and see what impact there is on place time (reduction hopefully) and other metrics (should be none).

Finally, the code is a bit nasty looking and difficult to comprehend. No action until you get the QoR data, but if it shows good QoR you'd need to comment it and make it as readable as you can. Could it be wrapped in a function and still maintain speed?

xuqinziyue commented 4 years ago

I have compared the vtr_stdout.log before and after the change, and I did not see any qor change there. I will run the qor and titan benchmark tests and see what's the qor result. I will create some inline functions and wrap the change into the functions.

mithro commented 4 years ago

@hzeller - Could you help @xuqinziyue find a solution which doesn't require the nasty code?

FYI - @litghost

hzeller commented 4 years ago

How does a primitive, but slightly more 'boring to read' solution like this measures, performance wise:

bool operator<(const t_override& lhs, const t_override& rhs) {
  std::initializer_list<short> left  = { lhs.from_type, lhs.to_type, lhs.from.class, lhs.to_class, lhs.delta_x, lhs.delta_y };
  std::initializer_list<short> right = { rhs.from_type, rhs.to_type, rhs.from.class, rhs.to_class, rhs.delta_x, rhs.delta_y };
  return std::lexicographical_compare(left.begin(), left.end(),
                                      right.begin(), right.end());
}

It would also make it easier to extend later for anyone adding a field without having to bitfiddle.

xuqinziyue commented 4 years ago

@hzeller I have just done a quick measurement with picosoc and here's the result:

placement time of original result: 37.86 placement time of using initializer_list: 37.03 placement time with the code I posted above: 34.95

so it seems changing std::tie to std::lexicographical_compare & initializer_list does not provide much performance benefit based on my measurement.

hzeller commented 4 years ago

Mmh, some thought to consider exploring: If the struct only contains these values and no holes, which it does:

struct t_override {
        short from_type;
        short to_type;
        short from_class;
        short to_class;
        short delta_x;
        short delta_y;
};.

Maybe we can treat it as array of 6 int16_t and cast it to some int16_t* and do some trickery with some relative relative of memcmp() that can deal with short ? (I'd made the struct packed then to make sure the layout is as we expect).

(Maybe wmemcmp() ? Though might be worthwhile to close test the assumptions w.r.t. signeness, sizeof(wchar_t) == sizeof(short) (which might not hold on typical platforms these days)).

CPUs/CPU-vendors might have int16 vector comparsion intrinsics (there I only see int32, but maybe there is int16 somewhere?)

And if there are only intrinsics available on some platforms, we can fall back to some 'slow' version with std::tie.

Does signedness matter ? (in your code you essentially treat them as unsigned short). If not, then pretending this is an array of three uint32_t might work ?

mithro commented 4 years ago

I would suggest you get out https://godbolt.org/ and see why the std::tie and the faster proposed method don't end up with similar set of assembly instructions.

hzeller commented 4 years ago

Oh yes, godbolt is funt. I suspect the std::tie and initializer list version create all kinds of branches while the data assembly version does better in this regard even though it has to slurp in the whole data vs. only the first mismatch in the other case.

xuqinziyue commented 4 years ago

@hzeller Thanks for indicating the signedness issue. My code above only works if two numbers have the same sign (both positive or both negative), which means in some cases the implementation won't work. Also thanks for your other suggestions, I will do some measurements to see which option works better.

xuqinziyue commented 4 years ago

@hzeller I tried some simple solutions like this:

      short left [6] = {lhs.from_type, lhs.to_type, lhs.from_class, lhs.to_class, lhs.delta_x, lhs.delta_y};
      short right [6]= {rhs.from_type, rhs.to_type, rhs.from_class, rhs.to_class, rhs.delta_x, rhs.delta_y};
      unsigned int i = 0;
      while (i < 6){
        if (left[i] == right[i]){
          i++;
        }
        else if (left[i] < right[i]){
          return true;
        }
        else{
          return false;
        }
      }
      return false;

With a implementation like this, the place time of picosoc is reduced from around 38s to around 36s, while gprof also reports around 5% reduction in placement time. Do you think if a solution like this is acceptable? I can also try more complicated solutions if necessary.

hzeller commented 4 years ago

Let's try an even simpler solution first (the simpler, the easier to maintain in the future).

Can you performance compare your manual loop (while leaving the array assignment the same) with a call to

std::lexicographical_compare(left, left+6, right, right+6)

? It should do the same thing, but that way, we give the std library an opportunity to employ optimized versions that might be available on that system. So I suspect it will not be slower than the hand-written compare (but possibly faster), but it is at least shorter to write (and read).

Another thing I'd try: we don't even have to copy the elements in the struct, we could just treat it as the array itself (if the struct is packed, so maybe compile-assert sizeof(that_struct_t) == sizeof(short * 6) - if not, we actually want to make that a packed struct; also we assume that the compiler layout is exactly the same sequence as we give it - though it is not really required to do so, given that they are all the same type they probably will).

   const short *left = reinterpret_cast<const short*>(&lhs);
   const short *right = reinterpret_cast<const short*>(&rhs);
   return std::lexicographical_compare(left, left+6, right, right+6);

hzeller commented 4 years ago

Looks like using std::lexicographical_compare vs. manual compare is very similar, in both cases the loop is unrolled with -O3.

And casting the struct to pretend it is an array of short is better still. So I'd go with reinterpret_cast<const short*>(&...) and std::lexicographical_compare()

https://godbolt.org/z/xVc53j

xuqinziyue commented 4 years ago

It seems based on the https://godbolt.org/z/xVc53j the result should be very similar, but when I actually run the test, manually implemented code is ~2s faster than lexicographical_compare.

I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag.

I found a function call like this within the operator<: callq 5455c0 <_ZSt30__lexicographical_compare_implIPKsS1_N9__gnu_cxx5__ops15_Iter_less_iterEEbT_S5_T0_S6_T1_.isra.96.lto_priv.4077>

and within lexicographical_compare_impl, the loop is not unrolled.

Using lexicographical_compare still reduces the place time for picosoc by around 2s though, so I am not sure what will be the preference

hzeller commented 4 years ago

did you also try the casting the address of the struct to const short* ?

xuqinziyue commented 4 years ago

The code I dumped the assembly for is:

const short *left = reinterpret_cast<const short*>(&lhs);
const short *right = reinterpret_cast<const short*>(&rhs);
return std::lexicographical_compare(left, left+6, right, right+6);

and the place time for this version is around 36s (compared with 38s)

hzeller commented 4 years ago

So I'd say choose the fastest version you came across: if the manual written version is faster, then use that (and write a comment that std::lexicographical_compare was considered but slightly slower). If const short* casting is better than copy-to-array, use that.

xuqinziyue commented 4 years ago

Based on my experiment, using const short* casting is faster than copy to array, so the fastest version up to now will be const short* casting+ manual comparison

Thank you so much for your help! I will create a pull request and run some qor benchmark tests later today!

mithro commented 4 years ago

@xuqinziyue

I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag.

I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed.

My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO?

xuqinziyue commented 4 years ago

I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed.

My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO?

I think the LTO is currently used since the -flto is there, but I am not familiar with LTO so the sample command for linking vpr:

cd /scratch/local/xuqinziy/vtr-verilog-to-routing/build/vpr && /tools/cmake/install/3.9.0/bin/cmake -E cmake_link_script CMakeFiles/vpr.dir/link.txt --verbose=1 /usr/bin/c++ -O3 -DNDEBUG -flto -fno-fat-lto-objects CMakeFiles/vpr.dir/src/main.cpp.o CMakeFiles/vpr.dir/resources.C.o -o vpr libvpr.a ../libs/libarchfpga/libarchfpga.a ../libs/libpugiutil/libpugiutil.a ../libs/EXTERNAL/libsdcparse/libsdcparse.a ../libs/EXTERNAL/libblifparse/libblifparse.a ../libs/EXTERNAL/libtatum/libtatum/libtatum.a ../libs/EXTERNAL/libargparse/libargparse.a ../libs/EXTERNAL/libpugixml/libpugixml.a ../libs/EXTERNAL/libezgl/libezgl.a -lgtk-3 -lgdk-3 -latk-1.0 -lgio-2.0 -lpangocairo-1.0 -lgdk_pixbuf-2.0 -lcairo-gobject -lpango-1.0 -lcairo -lgobject-2.0 -lglib-2.0 -lX11 ../libs/libvtrcapnproto/liblibvtrcapnproto.a ../libs/libvtrutil/libvtrutil.a ../libs/liblog/liblog.a ../libs/EXTERNAL/capnproto/c++/src/capnp/libcapnp.a ../libs/EXTERNAL/capnproto/c++/src/kj/libkj.a -lpthread

mithro commented 4 years ago

@xuqinziyue If the comparison method ends up in a separate compile unit to where it is used, then the compiler will be unable to inlining things.

xuqinziyue commented 4 years ago

@mithro The function that is not inlined is std::lexicographical_compare, and I am not sure what's the rule to be applied to a standard library like this.

mithro commented 4 years ago

https://en.cppreference.com/w/cpp/language/inline and https://gcc.gnu.org/onlinedocs/gcc/Inline.html might help.

mithro commented 4 years ago

Just FYI - that function call is the following according to c++filt <bool std::__lexicographical_compare_impl<short const*, short const*, __gnu_cxx::__ops::_Iter_less_iter>(short const*, short const*, short const*, short const*, __gnu_cxx::__ops::_Iter_less_iter) [clone .isra.96] [clone .lto_priv.4077]>

xuqinziyue commented 4 years ago

Just for more clarifications, I don't think there is a way to force the compiler to always inline a function in libraries. I haven't found anything that suggests std::lexicographical_compare is defined to be an inline function, and even if a function is defined as inline within the transition unit, compiler can still not inline the function. So isn't that compiler's choice on whether to inline the function or not?

mithro commented 4 years ago

/* Prototype.  */
inline void foo (const char) __attribute__((always_inline));

vaughnbetz commented 4 years ago

Easton: just discussing this in a meeting. Still confusion about why this isn't inlined properly by the compiler; ideally the fix would be to get this inlined. Maybe there is a bigger gain to be had if we figure out why inlining isn't happening (if it is a general issue)?

Does Tim's always_inline work?
Try make build_type = release_pgo to see if that also captures the cpu gains.
Is this also an issue on bigger designs (VTR, Titan). Need QoR data for them.

xuqinziyue commented 4 years ago

I have done some experiment with always_inline:

When operator< is implemented as follows:

 friend inline bool operator<(const t_override& lhs, const t_override& rhs) __attribute__((always_inline)) {
    const short *left = reinterpret_cast<const short*>(&lhs);
    const short *right = reinterpret_cast<const short*>(&rhs); 
    return std::lexicographical_compare(left, left+6, right, right+6);
}

The operator< and the lexicographical_compare seems be inlined properly by compiler and the performance is similar to manual implementation.

However when I change the implementation to

 friend inline bool operator<(const t_override& lhs, const t_override& rhs) __attribute__((always_inline)) {
    return std::tie(lhs.from_type, lhs.to_type, lhs.from_class, lhs.to_class, lhs.delta_x, lhs.delta_y)
                   < std::tie(rhs.from_type, rhs.to_type, rhs.from_class, rhs.to_class, rhs.delta_x, rhs.delta_y);
}

The operator< is not inlined properly again, and there is no reduction in execution time.

I have collected QoR for the manual implementation and it looks like this:

| original_results.txt | modified_results.txt vtr_flow_elapsed_time | 1 | 0.997897361 odin_synth_time | 1 | 0.939841239 abc_depth | 1 | 1 abc_synth_time | 1 | 1.058311689 num_clb | 1 | 1 num_memories | 1 | 1 num_mult | 1 | 1 max_vpr_mem | 1 | 1.001470174 num_pre_packed_blocks | 1 | 1 num_post_packed_blocks | 1 | 1 device_grid_tiles | 1 | 1 pack_time | 1 | 0.981313116 placed_wirelength_est | 1 | 1 place_time | 1 | 0.950503869 placed_CPD_est | 1 | 1 min_chan_width | 1 | 1 routed_wirelength | 1 | 1 min_chan_width_route_time | 1 | 1.01539243 crit_path_routed_wirelength | 1 | 1 critical_path_delay | 1 | 1 crit_path_route_time | 1 | 0.991334014

I can collect QoR and titan result for the inline change and try the build types to see how it goes

vaughnbetz commented 4 years ago

Thanks Easton. That code is significantly easier to read (would still need a clear comment explaining what it does and why it is coded that way). No idea why it doesn't inline as a friend function but as long as it inlines with one option we hopefully get the cpu improvement. Getting full QoR data makes sense.

xuqinziyue commented 4 years ago

Update QoR result for inline change with vtr master branch as of March 18. It seems generally modified code runs faster this time maybe due to the machine (there should be no change in pack time and route time), but change in place time is larger which indicates the effect of function inlining

| original_results.txt | modified_results.txt vtr_flow_elapsed_time | 1 | 0.98708214 odin_synth_time | 1 | 1.078524821 abc_depth | 1 | 1 abc_synth_time | 1 | 0.986797326 num_clb | 1 | 1 num_memories | 1 | 1 num_mult | 1 | 1 max_vpr_mem | 1 | 1.000129628 num_pre_packed_blocks | 1 | 1 num_post_packed_blocks | 1 | 1 device_grid_tiles | 1 | 1 pack_time | 1 | 0.974220108 placed_wirelength_est | 1 | 1 place_time | 1 | 0.94762915 placed_CPD_est | 1 | 1 min_chan_width | 1 | 1 routed_wirelength | 1 | 1 min_chan_width_route_time | 1 | 1.01647566 crit_path_routed_wirelength | 1 | 1 critical_path_delay | 1 | 1 crit_path_route_time | 1 | 0.969925067

xuqinziyue commented 4 years ago

An update for Titan results for the inline change:

following 3 tests did not run due to running out of disk space and data collected is based on other tests: segmentation_stratixiv_arch_timing.blif SLAM_spheric_stratixiv_arch_timing.blif des90_stratixiv_arch_timing.blif

| titan_original_results.txt | titan_modified_results.txt vtr_flow_elapsed_time | 1 | 0.975597952 num_LAB | 1 | 1 num_DSP | 1 | 1 num_M9K | 1 | 1 num_M144K | 1 | 1 max_vpr_mem | 1 | 1.000010433 num_pre_packed_blocks | 1 | 1 num_post_packed_blocks | 1 | 1 device_grid_tiles | 1 | 1 pack_time | 1 | 1.01225225 placed_wirelength_est | 1 | 1 place_time | 1 | 0.954590719 placed_CPD_est | 1 | 1 routed_wirelength | 1 | 1 critical_path_delay | 1 | 1 crit_path_route_time | 1 | 1.027508016

Titan results also suggest an around 5% reduction in place time.

vaughnbetz commented 4 years ago

Looks good to me. I think you just have to write some nice comments and commit.

mithro commented 4 years ago

Let's land your improvement.

I do think we should investigate further what is going on with std::tie stuff you mentioned here -> https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1225#issuecomment-606211485 but that is a task for someone in the future.

BTW To create a way to use always_inline attribute in a portable fashion, see the example in the Chromium source code at https://chromium.googlesource.com/chromium/src/base/+/refs/heads/master/compiler_specific.h#42

It looks like it should also be supported by a modern clang to -> https://clang.llvm.org/docs/LanguageExtensions.html#has-attribute

verilog-to-routing / vtr-verilog-to-routing