microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.67k stars 3.83k forks source link

Data Race when Predictor::Predictor is invoked for a shared Booster between multiple threads #6142

Open stonebrakert6 opened 1 year ago

stonebrakert6 commented 1 year ago

Description

When Predictor::Predictor is invoked for a shared Booster between multiple threads, it causes data races for shared data in Boosting object(GBDT)

Here is an API i.e LGBM_BoosterPredictForMatSingleRowFastInit() which if called from 2 separate threads concurrently using the same booster would cause data race(see the fully reproducible code below)

LGBM_BoosterPredictForMatSingleRowFastInit()-> Booster::SetSingleRowPredictor -> SingleRowPredictor::SingleRowPredictor() ->Predictor::Predictor() -> GBDT::InitPredict()

which writes concurrently i.e data race to variables(atleast) num_iteration_for_pred_and start_iteration_for_pred_ of Boosting object(concretely GBDT) src/boosting/gbdt.h:422

Here is another/alternate API i.e LGBM_BoosterPredictForMat() which when invoked concurrently for the same Booster, would cause data race

LGBM_BoosterPredictForMat() -> Booster::Predict() -> Booster::CreatePredictor() -> Predictor::Predictor()

See Issue 6024 comments here and here

6024

Below is the code which when ran with Thread Sanitizer should reproduce/prove the race I am trying to share a BoosterHandle between multiple threads only for inference/prediction. I intend to use the API LGBM_BoosterPredictForMatSingleRowFast and hence need to use LGBM_BoosterPredictForMatSingleRowFastInit to create/initialize a FastConfigHandle.

Reproducible example

#include <array>
#include <fstream>
#include <iostream>
#include <sstream>
#include <thread>
#include <vector>

#include "LightGBM/c_api.h"

const int kFeatures = 13;

std::vector<std::array<double, kFeatures>> readFile(const std::string& file) {
  std::vector<std::array<double, kFeatures>> ans;
  std::ifstream f(file);
  if (!f.is_open()) {
    std::cout << "Could not open file " << file << '\n';
    return ans;
  }
  bool is_header = true;
  std::string temp;
  int nline = 0;
  while (std::getline(f, temp)) {
    ++nline;
    if (is_header) {
      is_header = false;
      continue;
    }
    std::istringstream s(temp);
    std::string field;
    int idx = 0;
    std::array<double, kFeatures> row;
    while (std::getline(s, field, ',')) {
      row[idx++] = std::stod(field);
    }
    if (idx != kFeatures) {
      ans.clear();
      std::cout << "Incorrect # of cols in line " << nline << '\n';
      return ans;
    }
    ans.emplace_back(row);
  }
  return ans;
}

// shared booster handle for all threads
BoosterHandle handle;
// Input data
std::vector<std::array<double, kFeatures>> data;
// Final result or all predictions
std::vector<double> result;

void predict(ssize_t beg, ssize_t end) {
  FastConfigHandle config;
  int rc = LGBM_BoosterPredictForMatSingleRowFastInit(
      handle, C_API_PREDICT_NORMAL, 0, 0, C_API_DTYPE_FLOAT64, kFeatures, "",
      &config);
  if (rc != 0) {
    abort();
  }
  for (ssize_t i = beg; i < end; i++) {
    int64_t len = 0;
    rc = LGBM_BoosterPredictForMatSingleRowFast(config, &data[i], &len,
                                                &result[i]);
    if (rc != 0) {
      abort();
    }
  }
  rc = LGBM_FastConfigFree(config);
  if (rc != 0) {
    abort();
  }
}

int main(int argc, char* argv[]) {
  if (argc != 4) {
    std::cout
        << "Usage a.out <model_file> <input_file> <nworkers> ...exiting\n";
    return 1;
  }
  int nworkers = std::stoi(argv[3]);
  int num_iterations;
  std::cout << "Loading the Model from file\n";
  int rc = LGBM_BoosterCreateFromModelfile(argv[1], &num_iterations, &handle);
  if (rc != 0) {
    std::cout << "LGBM_BoosterCreateFromModelfile() returned " << rc << '\n';
    return 1;
  }
  data = readFile(argv[2]);
  ssize_t nrows = ssize(data);
  result.resize(nrows);
  std::vector<std::thread> workers(nworkers);
  ssize_t rows_per_thread = nrows / nworkers;
  for (ssize_t i = 0; i < nworkers; i++) {
    if (i != nworkers - 1) {
      workers[i] =
          std::thread(predict, rows_per_thread * i, rows_per_thread * (i + 1));
    } else {
      workers[i] = std::thread(predict, rows_per_thread * i, nrows);
    }
  }
  for (std::thread& t : workers) {
    t.join();
  }
  rc = LGBM_BoosterFree(handle);
  if (rc != 0) {
    abort();
  }
  return 0;
}

Environment info

LightGBM version or commit hash:

git log --oneline

8ed371ce (HEAD -> master, origin/master, origin/HEAD) set explicit number of threads in every OpenMP parallel region (#6135)

Command(s) you used to install LightGBM

# this is part of a Makefile
mkdir -p LightGBM/build
env CC=$(CC) CXX=$(CXX) cmake -DUSE_DEBUG=ON -DUSE_SANITIZER=ON -DENABLED_SANITIZERS="thread" -DUSE_OPENMP=OFF -S LightGBM -B LightGBM/build
env CC=$(CC) CXX=$(CXX) VERBOSE=1 $(MAKE) -C LightGBM/build

Additional Comments

TSAN_OPTIONS="halt_on_error=1" ./builds/debug/d.out ~/Downloads/model.txt ~/Downloads/input_1k.txt 2
Loading the Model from file
==================
WARNING: ThreadSanitizer: data race (pid=15113)
  Read of size 4 at 0x7b5400000164 by thread T2:
    #0 int const& std::min<int>(int const&, int const&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/stl_algobase.h:235:11 (lib_lightgbm.so+0x58dcd5)
    #1 LightGBM::GBDT::InitPredict(int, int, bool) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/boosting/gbdt.h:424:23 (lib_lightgbm.so+0x587276)
    #2 LightGBM::Predictor::Predictor(LightGBM::Boosting*, int, int, bool, bool, bool, bool, int, double) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/application/predictor.hpp:61:15 (lib_lightgbm.so+0x54125e)
    #3 LightGBM::SingleRowPredictorInner::SingleRowPredictorInner(int, LightGBM::Boosting*, LightGBM::Config const&, int, int) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:81:26 (lib_lightgbm.so+0x564367)
    #4 LightGBM::SingleRowPredictor::SingleRowPredictor(yamc::alternate::basic_shared_mutex<yamc::rwlock::ReaderPrefer>*, char const*, int, int, int, LightGBM::Boosting*, int, int) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:124:109 (lib_lightgbm.so+0x564d12)
    #5 LightGBM::Booster::InitSingleRowPredictor(int, int, int, int, int, char const*) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:440:52 (lib_lightgbm.so+0x51f55f)
    #6 LGBM_BoosterPredictForMatSingleRowFastInit /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:2443:18 (lib_lightgbm.so+0x50bb41)
    #7 predict(long, long) /home/kartik/codeberg/bug_lightgbm/main.cc:54:12 (d.out+0x12c76a)
    #8 decltype(std::declval<void (*)(long, long)>()(std::declval<long>(), std::declval<long>())) std::__1::__invoke[abi:v170000]<void (*)(long, long), long, long>(void (*&&)(long, long), long&&, long&&) /home/kartik/build/git_llvm/bin/../include/c++/v1/__type_traits/invoke.h:340:25 (d.out+0x13e602)
    #9 void std::__1::__thread_execute[abi:v170000]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(long, long), long, long, 2ul, 3ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(long, long), long, long>&, std::__1::__tuple_indices<2ul, 3ul>) /home/kartik/build/git_llvm/bin/../include/c++/v1/__thread/thread.h:221:5 (d.out+0x13e4ef)
    #10 void* std::__1::__thread_proxy[abi:v170000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(long, long), long, long>>(void*) /home/kartik/build/git_llvm/bin/../include/c++/v1/__thread/thread.h:232:5 (d.out+0x13dbb2)

  Previous write of size 4 at 0x7b5400000164 by thread T1:
    #0 LightGBM::GBDT::InitPredict(int, int, bool) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/boosting/gbdt.h:422:29 (lib_lightgbm.so+0x587214)
    #1 LightGBM::Predictor::Predictor(LightGBM::Boosting*, int, int, bool, bool, bool, bool, int, double) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/application/predictor.hpp:61:15 (lib_lightgbm.so+0x54125e)
    #2 LightGBM::SingleRowPredictorInner::SingleRowPredictorInner(int, LightGBM::Boosting*, LightGBM::Config const&, int, int) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:81:26 (lib_lightgbm.so+0x564367)
    #3 LightGBM::SingleRowPredictor::SingleRowPredictor(yamc::alternate::basic_shared_mutex<yamc::rwlock::ReaderPrefer>*, char const*, int, int, int, LightGBM::Boosting*, int, int) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:124:109 (lib_lightgbm.so+0x564d12)
    #4 LightGBM::Booster::InitSingleRowPredictor(int, int, int, int, int, char const*) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:440:52 (lib_lightgbm.so+0x51f55f)
    #5 LGBM_BoosterPredictForMatSingleRowFastInit /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:2443:18 (lib_lightgbm.so+0x50bb41)
    #6 predict(long, long) /home/kartik/codeberg/bug_lightgbm/main.cc:54:12 (d.out+0x12c76a)
    #7 decltype(std::declval<void (*)(long, long)>()(std::declval<long>(), std::declval<long>())) std::__1::__invoke[abi:v170000]<void (*)(long, long), long, long>(void (*&&)(long, long), long&&, long&&) /home/kartik/build/git_llvm/bin/../include/c++/v1/__type_traits/invoke.h:340:25 (d.out+0x13e602)
    #8 void std::__1::__thread_execute[abi:v170000]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(long, long), long, long, 2ul, 3ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(long, long), long, long>&, std::__1::__tuple_indices<2ul, 3ul>) /home/kartik/build/git_llvm/bin/../include/c++/v1/__thread/thread.h:221:5 (d.out+0x13e4ef)
    #9 void* std::__1::__thread_proxy[abi:v170000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(long, long), long, long>>(void*) /home/kartik/build/git_llvm/bin/../include/c++/v1/__thread/thread.h:232:5 (d.out+0x13dbb2)

  Location is heap block of size 584 at 0x7b5400000000 allocated by main thread:
    #0 operator new(unsigned long) /home/kartik/llvm-project/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:64:3 (d.out+0x12b377)
    #1 LightGBM::Boosting::CreateBoosting(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, char const*) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/boosting/boosting.cpp:51:19 (lib_lightgbm.so+0x582aa4)
    #2 LightGBM::Booster::Booster(char const*) /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:164:21 (lib_lightgbm.so+0x519834)
    #3 LGBM_BoosterCreateFromModelfile /home/kartik/codeberg/lightgbm_bm/third_party/LightGBM/src/c_api.cpp:1843:43 (lib_lightgbm.so+0x503843)
    #4 main /home/kartik/codeberg/bug_lightgbm/main.cc:83:12 (d.out+0x12c98c)

  Thread T2 (tid=15117, running) created by main thread at:
    #0 pthread_create /home/kartik/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1020:3 (d.out+0xa43db)
    #1 std::__1::__libcpp_thread_create[abi:v170000](unsigned long*, void* (*)(void*), void*) /home/kartik/build/git_llvm/bin/../include/c++/v1/__threading_support:371:10 (d.out+0x13db29)
    #2 std::__1::thread::thread<void (&)(long, long), long, long&, void>(void (&)(long, long), long&&, long&) /home/kartik/build/git_llvm/bin/../include/c++/v1/__thread/thread.h:248:16 (d.out+0x12e35e)
    #3 main /home/kartik/codeberg/bug_lightgbm/main.cc:98:20 (d.out+0x12ccaa)

  Thread T1 (tid=15116, running) created by main thread at:
    #0 pthread_create /home/kartik/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1020:3 (d.out+0xa43db)
    #1 std::__1::__libcpp_thread_create[abi:v170000](unsigned long*, void* (*)(void*), void*) /home/kartik/build/git_llvm/bin/../include/c++/v1/__threading_support:371:10 (d.out+0x13db29)
    #2 std::__1::thread::thread<void (&)(long, long), long, long, void>(void (&)(long, long), long&&, long&&) /home/kartik/build/git_llvm/bin/../include/c++/v1/__thread/thread.h:248:16 (d.out+0x12e088)
    #3 main /home/kartik/codeberg/bug_lightgbm/main.cc:96:11 (d.out+0x12cbe3)

SUMMARY: ThreadSanitizer: data race /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/stl_algobase.h:235:11 in int const& std::min<int>(int const&, int const&)
==================
Ten0 commented 1 year ago

LGBM_BoosterPredictForMatSingleRowFastInit()-> Booster::SetSingleRowPredictor

I think the second example can indeed race but I'm not sure how the first one does currently race before #6024 because there's a unique lock here: https://github.com/microsoft/LightGBM/blob/0f7983b6c3443154441cecaa342462d4567760b7/src/c_api.cpp#L377