intel / qpl

Intel® Query Processing Library (Intel® QPL)
https://intel.github.io/qpl/
MIT License
98 stars 19 forks source link

[fix] fix dispatch table copy #24

Open Jonas-Heinrich opened 1 year ago

Jonas-Heinrich commented 1 year ago

Hi,

while comparing QPL to TUM Umbra's related functionality, I noticed that the former has a large overhead for small amounts of tuples. After investigating with VTune and the following microbenchmark, I think I found a typo that leads to an accidental copy of the function pointer table in input_stream_t::initialize_sw_kernels. Here's the benchmark:

#include "qpl/qpl.h"
#include <benchmark/benchmark.h>

#include "../DataDistribution.hpp"
#include "../Utils.hpp"

int main() {
    // pointer wrapper with destructor
   ExecutionContext fixture(qpl_path_software);
   qpl_job* job = reinterpret_cast<qpl_job*>(fixture.getExecutionContext());

   uint32_t numTuples = 1;
   double selectivity = 0.5;
   uint32_t inBitWidth = 8;
   qpl_out_format outFormat = qpl_out_format::qpl_ow_32;

   // private utilities, does what you would expect
   LittleEndianBufferBuilder builder(inBitWidth, numTuples);
   Dataset dataset = UniformDistribution::generateDataset(
      builder,
      inBitWidth,
      (1ul << inBitWidth) - 1,
      numTuples,
      umbra::VectorizedFunctions::Mode::Eq,
      selectivity);
   std::vector<uint8_t> destination;
   destination.resize(divceil(32 * numTuples, 8));

   for (size_t i = 0; i < 1'000'000'000; i++) {
      if (i % 1'000'000 == 0) {
         std::cout << i << std::endl;
      }

      // Parameterize jobs.
      {
         job->parser = builder.getQPLParser();
         job->next_in_ptr = dataset.data.data();
         job->available_in = dataset.data.size();
         job->next_out_ptr = destination.data();
         job->available_out = static_cast<uint32_t>(destination.size());
         job->op = map_umbra_to_qpl_op(umbra::VectorizedFunctions::Mode::Eq);
         job->src1_bit_width = inBitWidth;
         job->num_input_elements = numTuples;
         job->out_bit_width = outFormat;
         auto [param_low, param_high] = dataset.predicateArguments->template getValues<uint32_t, 2>();
         job->param_low = param_low;
         job->param_high = param_high;
         job->flags = static_cast<uint32_t>(QPL_FLAG_OMIT_CHECKSUMS | QPL_FLAG_OMIT_AGGREGATES);
      }

      qpl_status status = qpl_execute_job(job);
      if (status != QPL_STS_OK) {
         ERROR("An error occurred during job execution: " << status);
      }

      const auto indicesByteSize = job->total_out;
      const auto bytesPerHit = divceil(32, 8);
      const auto qplHits = indicesByteSize / bytesPerHit;
      if ((outFormat != qpl_ow_nom && qplHits != dataset.predicateHits)) {
         ERROR("Result does not fit expectations: " << qplHits << " != " << dataset.predicateHits);
      } else if (outFormat == qpl_ow_nom && job->total_out != divceil(numTuples, 8)) {
         ERROR("Result does not fit expectations: " << job->total_out << " != " << divceil(numTuples, 8));
      }

      benchmark::DoNotOptimize(job->total_out);
      benchmark::DoNotOptimize(status);
      benchmark::DoNotOptimize(destination);
      benchmark::ClobberMemory();
   }
}

The benchmark was run for 15s on an i9 13900K. Screenshot of VTune summary before the PR:

Screenshot 2023-09-03 at 12 45 58

after the PR:

Screenshot 2023-09-03 at 13 39 15

Jonas-Heinrich commented 1 year ago

After further investigation, I noticed that the issue is also present in other locations. The second force push now includes other locations (found by searching for core_sw::dispatcher::kernels_dispatcher::get_instance()).

mzhukova commented 1 year ago

Hi @Jonas-Heinrich thank you for the contribution! Team would review and do a thorough testing on our side in the upcoming weeks.