Issue with call to c_api Predict functions in multi threaded way from golang

tarunreddy1018 commented 3 years ago

How we are using LightGBM:

We are using the LightGBM c_api in our service. But this c_api is being called from golang. Basically we have a golang wrapper around the c_api. And we call the c_api functions from golang using cgo.

We get the “lib_lightgbm.so” library file from the Github release section and use them.

Version of LightGBM being used:

3.1.1 (Observed with all versions >= 3.0.0)

LightGBM component:

C++ Library

Environment info:

Operating System: Observed on both Linux and MacOS

Architecture: x86_64

CPU model: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

C++ compiler version: gcc version 7.2.0 (Debian 7.2.0-1)

CMake version: 3.9.0

GLIBC version: ldd (Debian GLIBC 2.28-10) 2.28

Context:

Starting with LightGbm 3.0.0 version, there is an update in Predict functions where only the necessary parts of the predict function have been given with "unique_lock", and rest of the parts where they just do computation were given with "shared_lock".

Unlike the previous version like 2.3.1, where a "lock_guard" was used in the beginning of the predict function itself. This led to the situation where only one thread can execute this predict function at a time. This could not scale better in production environments where we do many predictions in a parallel way.

So, we decided to upgrade to a newer version of LightGbm 3.1.1. Which really improved the performance. Since now multiple threads can execute the predict function in a parallel way because of the "shared_lock".

Issue:

But, Now we saw that the predictions were not consistent. That is if there are many threads which are invoking this predict function in a parallel way on a large scale, And if we give the same input multiple times to the service, it turns out that we are getting different predictions each time we make a call.

This was not the issue with earlier LightGBM version that we used 2.3.1. All the predictions were consistent with that version. It could probably be because of the lock that it had in the predict function because of which we did not notice this issue.

To test this out quickly, we had to put a lock before we made a call to predict function. And now all the predictions were as expected and consistent.

So, we thought that there could be some race condition in the code or LightGbm was not meant to do predictions in a parallel way. To investigate it further we started looking into the c_api of predict function. And found out that it has a vector of vectors from which each thread gets its vector using thread id (tid) and copies the input data into that vector and passes it to predict function. Once the predict call is completed it clears the vector.

Definition of Vector of Vectors:
std::vector<std::vector<double, Common::AlignmentAllocator<double, kAlignedSize>>> predictbuf

Code where Vector of Vectors is used:

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,
                                             double* output)  {
          int tid = omp_get_thread_num();
          if (num_feature_ > kFeatureThreshold &&
              features.size() < KSparseThreshold) {
            auto buf = CopyToPredictMap(features);
            boosting_->PredictByMap(buf, output, &early_stop_);
          } else {
            CopyToPredictBuffer(predict_buf_[tid].data(), features);
            boosting_->Predict(predict_buf_[tid].data(), output, &early_stop_);
            ClearPredictBuffer(predict_buf_[tid].data(),
                               predict_buf_[tid].size(), features);
          }
        };

So, we suspected that the problem could be arising from this vector of vectors and tried logging the tid into a file and surprisingly we saw that all of the calls were logging "tid" as "0".

This means that all the threads were using the same vector with "tid" as "0". Which led to a race condition.

Could you please let us know if the latest update was not meant to do predictions in a parallel way or Could it be some issue with our golang code calling c_api (Probably with openMP)?

StrikerRUS commented 3 years ago

Linking #2760.

cc @JoanFM

JoanFM commented 3 years ago

I will try to take a look if I take some time but not sure if I will be able soon

JoanFM commented 3 years ago

From the description it seems to be the thread_id the source of the problem then? So can it be that this is more of a preexisting issue that #2760 just exposed?

JoanFM commented 3 years ago

Could this be related to https://forum.openmp.org//viewtopic.php?f=3&t=653&start=0 ?

tarunreddy1018 commented 3 years ago

@JoanFM Let me check with a sample c++ code if my issue is also the same as given in the link. I will get back to you on this soon.

tarunreddy1018 commented 3 years ago

Hi @AlbertoEAF, Can you check this thread. That one was an older thread which I had posted before I did any observations

JoanFM commented 3 years ago

Did u manage to reproduce the issue as in the thread @tarunreddy1018 ?

tarunreddy1018 commented 3 years ago

Hi @JoanFM , I have not yet tried reproducing the issue, I will first try to check running few sample c++ code as mentioned in the link. This will help in giving a clear picture. If the thread_id (tid) still comes out to be 0, then the issue was something related to our openmp library linking and not really with c api. But will check that and get back to you on the same thread.

Mean while can you let me know, starting from v3.0.0 there were changes in locking strategy right. Having "shared_locks" and "unique_locks". So, seeing "shared_locks" in the predict function call made me believe that we can call predict functions in a parallel way. This should give performance improvement. So, were the predict functions written with an intent to make them call in a parallel way?

I mean was there any test done, which calls predict function in multi threaded way and that ensures thread safety. If any one else also faced this issue or its just me who is facing this issue. There are 2 things that could be the potential issues here

Issue with some openmp linking and its usage with golang code calling c api.
Issue in c api. (Issue with thread safety)

JoanFM commented 3 years ago

Hi @JoanFM , I have not yet tried reproducing the issue, I will first try to check running few sample c++ code as mentioned in the link. This will help in giving a clear picture. If the thread_id (tid) still comes out to be 0, then the issue was something related to our openmp library linking and not really with c api. But will check that and get back to you on the same thread.

Mean while can you let me know, starting from v3.0.0 there were changes in locking strategy right. Having "shared_locks" and "unique_locks". So, seeing "shared_locks" in the predict function call made me believe that we can call predict functions in a parallel way. This should give performance improvement. So, were the predict functions written with an intent to make them call in a parallel way?

Hey @tarunreddy1018 ,

I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it.

When I opened the PR I had the intention to use that in a project but I did not get to the point of using it in a multithreaded way so did not fully test the behavior.

AlbertoEAF commented 3 years ago

Thanks for the very detailed report @tarunreddy1018 ;)

Linking #2760.

cc @JoanFM

Great catch @StrikerRUS! I was looking for that

From the description it seems to be the thread_id the source of the problem then? So can it be that this is more of a preexisting issue that #2760 just exposed?

I think @JoanFM might be dead right, as in my experience lightgbmlib 2.3.150 (which corresponds to some commit between v2.3.1 and v2.3.2) this random scores already existed when using Java. Such is the case that we had to use synchronization on our Java provider code.

Here is the proof, we had to wrap our provider's Predict calls (line 147) with a single-access region (no parallelism): https://github.com/feedzai/feedzai-openml-java/blob/24a62dca18ddd5cf538256f1acc621cb3b092220/openml-lightgbm/src/main/java/com/feedzai/openml/provider/lightgbm/LightGBMSWIG.java#L138

and we used at the time lightgbmlib 2.3.150 (https://github.com/feedzai/feedzai-openml-java/blob/24a62dca18ddd5cf538256f1acc621cb3b092220/openml-lightgbm/pom.xml#L26)

If we didn't do the synchronization (no parallelism) we would have random scores.

I think @tarunreddy1018 's observations are dead right too regarding those thread id's and could be a good source for investigation. I just don't understand why he didn't have such issues with the old code when I had, but clearly the issues were there already.

I never looked into LightGBM's multi-threading implementation, so here are my 2 cents - take it with a grain of salt:

In the end it means that we should probably go back to unique locks in the meantime until the issue is fixed as a stop-gap if that fixes it, and use shared locks only on regions that truly do read-only calls. It might be that this stop-gap is useless if there are previous issues like 2.3.150 seemed to show though.

This thread gives more info that further proves the issue with predictions and multi-threading: https://github.com/microsoft/LightGBM/issues/3675#issuecomment-759514268.

AlbertoEAF commented 3 years ago

I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it.

A method being const in C++ can still manipulate a lot of data. In LightGBM on top of that we do a lot of pointer manipulation so all guarantees of constness go out the door. We probably need to do a much more thorough review of the code.

AlbertoEAF commented 3 years ago

@tarunreddy1018 can you just reformat your original question to have the code block encased by a line starting with 3 backquotes and cpp (i.e. "```cpp"), followed by the code block and then terminated with a line with 3 backquotes: "```"?

Thanks :)

JoanFM commented 3 years ago

I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it.

I wonder if with so much macros, pointer manipulation and shared variables a method being const is any guarantee of constness? Most probably it's not and the code needs to be reviewed more carefully.

I remember doing looking at the code and seemed safe but sure I may have missed something.

tarunreddy1018 commented 3 years ago

@AlbertoEAF reformatted

AlbertoEAF commented 3 years ago

I remember doing looking at the code and seemed safe but sure I may have missed something.

Yeah and from my point of view, the problems start before that even. I don't have time to look into this but at least @tarunreddy1018 's thread id observations might be a hint for someone to explore.

In the meanwhile I'd suggest @tarunreddy1018 to put a lock in your code at least for now if you have some urgency as I don't think this can easily be fixed soon. It's still faster to use v3.1 than v2.3.1 due to other improvements behind the scenes and the new APIs such as the *Fast() prediction methods.

tarunreddy1018 commented 3 years ago

@AlbertoEAF So, you would suggest to make a change in c api myself to have a lock in predict call. And build the library from that. And use this library right? This should give some performance improvements compared to 2.3.1

tarunreddy1018 commented 3 years ago

@AlbertoEAF Sure, I will check with thread id thing as soon as I get some time.

AlbertoEAF commented 3 years ago

@AlbertoEAF So, you would suggest to make a change in c api myself to have a lock in predict call. And build the library from that. And use this library right? This should give some performance improvements compared to 2.3.1

@tarunreddy1018 I was suggesting doing the lock on the golang side around predicts, much easier :)

Of course you can always modify the code and re-compile lightgbm but I don't think that will get you more performance than the solution above and will let you later try out different versions with and without your lock, meaning that when a new version comes out fixed you only need to remove your lock in golang ;)

tarunreddy1018 commented 3 years ago

Yes Thanks, Will keep things posted as soon as I find any observations. That should help @AlbertoEAF :)

JoanFM commented 3 years ago

I had a little closer look at the code, and it seems that this predict_buf_ is resized to the size of OMP_NUM_THREADS() it would be interesting to know which value u get there.

My question is how it was this multithreading using OMP was being used if there was a lock guard protecting the c_api.

I understand u are calling at this function, am I right?

  void Predict(int start_iteration, int num_iteration, int predict_type, int nrow, int ncol,
               std::function<std::vector<std::pair<int, double>>(int row_idx)> get_row_fun,
               const Config& config,
               double* out_result, int64_t* out_len) const {
    SHARED_LOCK(mutex_);
    auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);
    bool is_predict_leaf = false;
    bool predict_contrib = false;
    if (predict_type == C_API_PREDICT_LEAF_INDEX) {
      is_predict_leaf = true;
    } else if (predict_type == C_API_PREDICT_CONTRIB) {
      predict_contrib = true;
    }
    int64_t num_pred_in_one_row = boosting_->NumPredictOneRow(start_iteration, num_iteration, is_predict_leaf, predict_contrib);
    auto pred_fun = predictor.GetPredictFunction();
    OMP_INIT_EX();
    #pragma omp parallel for schedule(static)
    for (int i = 0; i < nrow; ++i) {
      OMP_LOOP_EX_BEGIN();
      auto one_row = get_row_fun(i);
      auto pred_wrt_ptr = out_result + static_cast<size_t>(num_pred_in_one_row) * i;
      pred_fun(one_row, pred_wrt_ptr);
      OMP_LOOP_EX_END();
    }
    OMP_THROW_EX();
    *out_len = num_pred_in_one_row * nrow;
  }

And the code snippet you shared is coming from the predictor. Since this predictor seems to live in the stack of every thread, I do not see why the predict_buf_ can be affected by different threads.

I am sure I must be missing something ...

Could it be that somehow 'regular' multithreading vs 'OpenMP' multithreading do not play well along together?

I am sorry @tarunreddy1018, but I finally did not implement and apply this feature in productio after all.

JoanFM commented 3 years ago

Another observation I found is,

It feels that what OpenMP is trying to do and what the unique / shared locks are trying to do are two different parallelization models.

OpenMP is trying to parallelize one single prediction as much as possible around multiple trees, thus I guess optimizing for the speed of a single predictor while the shared/unique lock tries to enable to use the same predictor with multiple threads which may not optimize for a single prediction but could improve throughput.

Could it be a hint of the possible problem?

tarunreddy1018 commented 3 years ago

Hi, @JoanFM , The methods that we have been trying were LGBM_BoosterPredictForMatSingleRow and LGBM_BoosterPredictForMatSingleRowFast and it looks like these two methods are not creating a new Predictor object for each thread. But rather they have a "unique_lock" and they only let one thread to create a new Predictor object and all the other threads use this predictor object once created. This in turn makes the the predict_buf_ vector living inside this Predictor object be used by all the other threads and makes it unsafe. And more over the size of this predict_buf_ is set to OMP_NUM_THREADS() which I have checked is equal to the nthreads parameter value that we pass in the config while making the predict call. This was strange because this size of predict_buf_ was since equal to OMP_NUM_THREADS() that means it is meant to be used and parallelize a single row prediction not across multiple predictions right?

Based on your observation it seems correct that each thread creates a new Predictor object as seen from this code CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);. And as you said this predictor seems to live in the stack of every thread. But this Predict function that you posted above is not being used by these two predict methods that I have mentioned above. This makes me believe that some how this Predict object is not unique for each call but rather being shared and this led to the thread-unsafe

This is the function call stack how the Predictor object is being created by the above two mentioned methods

fastConfig_ptr->booster->SetSingleRowPredictor(start_iteration, num_iteration, predict_type, fastConfig_ptr->config); (Called inside "LGBM_BoosterPredictForMatSingleRow")

single_row_predictor_[predict_type].reset(new SingleRowPredictor(predict_type, boosting_.get(),
config, start_iteration, num_iteration)); (Called inside "SetSingleRowPredictor")

predictor_.reset(new Predictor(boosting, start_iter, iter_, is_raw_score, is_predict_leaf, predict_contrib,
early_stop_, early_stop_freq_, early_stop_margin_))(Called inside the constructor of "SingleRowPredictor");

void SetSingleRowPredictor(int start_iteration, int num_iteration, int predict_type, const Config& config) {
      UNIQUE_LOCK(mutex_)
      if (single_row_predictor_[predict_type].get() == nullptr ||
          !single_row_predictor_[predict_type]->IsPredictorEqual(config, num_iteration, boosting_.get())) {
        single_row_predictor_[predict_type].reset(new SingleRowPredictor(predict_type, boosting_.get(),
                                                                         config, start_iteration, num_iteration));
      }
  }

This clearly shows that the above creation of new SingleRowPredictor is happening only once, Because in the code it looks like it is only creating if it is not created before.

This makes all threads across different prediction calls use same Predictor object. Which in turn makes predict_buf_ unsafe.

JoanFM commented 3 years ago

So maybe there is a problem that some of the Predictions can be thread safe and other can't.

I do think that the OMP num_threads are supposed to work for a single prediction, but I have not much expertise on that

tarunreddy1018 commented 3 years ago

I was thinking can we have the same logic as we had in other predict functions which are thread-safe. I mean to say that can we also have the other 2 predict functions also create a new Predictor object for each call as they do in other predict function. I believe this makes the other two thread safe as well. Right now the state of those 2 predict functions are such that we can only call one at a time. I am not completely sure if this is something related to bug or that was the intent from the beginning. Because I believe if other predict functions create a new Predictor object for each call why not these 2 predict do the same?

JoanFM commented 3 years ago

I would agree, but I am not surr if allocating the buffer at each call may degrade the performance of single threaded too much? but anyhow it would benefit from more consistency.

But as you I am not sure if it had some specific intentions

AlbertoEAF commented 3 years ago

Because I believe if other predict functions create a new Predictor object for each call why not these 2 predict do the same?

Which ones create a Predictor all the time? The versions that are not "SingleRow" ? If so it makes sense, as creating a Predictor per single prediction can generate a large overhead. Remember that in low-latency scenarios memory allocation is one of the biggest factors.

I don't mean it can't be tried, but it would have to be benchmarked, and the single-thread scenario too. Maybe it's really the case that the SingleRow versions are not meant for thread-safety, as it would require too many memory allocations / unit of work.

@guolinke can we have your thoughts on the matter or of someone who has dealt with this in the past?

tarunreddy1018 commented 3 years ago

@AlbertoEAF yes for example LGBM_BoosterPredictForMats, LGBM_BoosterPredictForCSR, LGBM_BoosterPredictForMat all these predict functions make a call to this line

auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);

This makes me suspect that all these predict functions create a new Predict object per thread call. And this Predictor Object has predict_buf_ vector of vectors as one of its field and each call to this Predict functions parallelize the work among multiple omp threads and the number of threads will be specified by this parameter num_threads=1 that we pass when we make the call to predict function.

What I believe is, these methods are meant to be used for larger datasets when there are many rows of input. And lets says the number of threads is 8 that we specify and it creates 8 omp threads and all these rows will be divided across these 8 omp threads to speed up the computation.

So, the amount of performance gain that we get by dividing these large set of rows across limited number of threads is more as compared to calling the LGBM_BoosterPredictForMatSingleRow for each row individually and creating a Predictor object for each call. So, I believe this single row methods were meant to optimize keeping in mind that each row will be called independently one after the other rather than calling in a multi threaded way.

Also I think if we want to have LGBM_BoosterPredictForMatSingleRow to create a Predictor object for each call in order to make it thread safe, we would rather use this LGBM_BoosterPredictForMat for single row prediction where nrow = 1. In fact the single row predict functions are just a special case for the Mat predict function where we just have one row. But they are optimized to just do single row predictions.

tarunreddy1018 commented 3 years ago

Hi, I have just verified and the predictions were consistent and as expected in multi threaded way with predict functions other than these two LGBM_BoosterPredictForMatSingleRow and LGBM_BoosterPredictForMatSingleRowFast. So, my guess was right that these single row predict functions are not thread safe since all of them share the same Predictor object.

AlbertoEAF commented 3 years ago

Yes exactly @tarunreddy1018 :)

Also I think if we want to have LGBM_BoosterPredictForMatSingleRow to create a Predictor object for each call in order to make it thread safe, we would rather use this LGBM_BoosterPredictForMat for single row prediction where nrow = 1. In fact the single row predict functions are just a special case for the Mat predict function where we just have one row. But they are optimized to just do single row predictions.

True you can always use the "batch" prediction methods with a single instance if you want to do parallel calls ;). And yes, the single instance versions are optimized variants that minimize allocation overheads.

I think it would be nice to have confirmation from @guolinke, @StrikerRUS, or another of the main devs that this is not a bug. If it's really the case that they were never meant to be thread-safe, we should add that to the C API documentation as you're not the first person having this issue ;)

JoanFM commented 3 years ago

Hi, I have just verified and the predictions were consistent and as expected in multi threaded way with predict functions other than these two LGBM_BoosterPredictForMatSingleRow and LGBM_BoosterPredictForMatSingleRowFast. So, my guess was right that these single row predict functions are not thread safe since all of them share the same Predictor object.

I think the first action ahould be to add UNIQUE_LOCK to thr unsafe ones and add some documentations and learning from this thread?

AlbertoEAF commented 3 years ago

I think the first action ahould be to add UNIQUE_LOCK to thr unsafe ones and add some documentations and learning from this thread?

If you guys can make a PR and test with the unique lock and test that it no longer creates random outputs that would be great ;)

Just take care to do it at the innermost scope possible to minimize parallel contention so we get the best performance ;)

What I mean is that LGBM_BoosterPredictForMatSingleRowFast probably doesn't need a lock, but it's inner calls which are shared with LGBM_BoosterPredictForMatSingleRow do @JoanFM :)

I would also be interested in using that version so I could remove locks from my external provider code, so if you do open a PR please ping me there ;).

JoanFM commented 3 years ago

I think the first action ahould be to add UNIQUE_LOCK to thr unsafe ones and add some documentations and learning from this thread?

If you guys can make a PR and test with the unique lock and test that it no longer creates random outputs that would be great ;)

Just take care to do it at the innermost scope possible to minimize parallel contention so we get the best performance ;)

What I mean is that LGBM_BoosterPredictForMatSingleRowFast probably doesn't need a lock, but it's inner calls which are shared with LGBM_BoosterPredictForMatSingleRow do @JoanFM :)

I would also be interested in using that version so I could remove locks from my external provider code, so if you do open a PR please ping me there ;).

I am very short in available time to be honest. @tarunreddy1018 maybe since u have now the setup for the test you may be able to test that?

tarunreddy1018 commented 3 years ago

@JoanFM Yes I can put a "unique_lock" in the beginning of predict call and build the code and test if it makes it thread safe. And report over here. But regarding the performance I may not be able to do it full fledged since my setup is from golang and it may not be a true one. But definitely I can test if having a unique lock makes these two predict functions thread safe. If yes, then I believe this change can be added sometime in next update when ever you get time to work on this. I believe this should not be a huge change. Just need to verify.

StrikerRUS commented 3 years ago

@AlbertoEAF

I think it would be nice to have confirmation from @guolinke, @StrikerRUS, or another of the main devs that this is not a bug.

Absolutely agree! But I don't have any serious experience with cpp code, sorry. So I believe only @guolinke @chivee @shiyu1994 @btrotta can help.

AlbertoEAF commented 3 years ago

Hello @tarunreddy1018 I did a small C++ benchmark with 2 threads, 2 different input events and 500k instances each, and can confirm your and @JoanFM 's observations. I changed the shared lock to unique lock and confirmed scores in the C++ side stop having errors. Given that there is a lock there, this can only be classified as a bug and so I can submit a patch.

Even so, I am getting performance issues in C++, more threads take more time to execute:

1 thread ~ 41s
2 threads ~ 47s
4 threads ~ 1m
most likely due to lock contention.

I'll now try the same benchmark (scoring 1M events) but with the Fast variant and see how timings go, as I believe there is more possibility of optimization there, even at the level of PredictSingleRow call by changing order of statements and when locks are acquired.

I wonder if our locking library is performant though, given the above timings. Might be excusable due to very small work units (I'm using a very basic LGBM model with only 4 features). If I had a more complex model probably more threads would give more gains. Hard to say without more complex workloads thinking

AlbertoEAF commented 3 years ago

Yup, with the Fast variant I get:

1 thread ~ 29s
2 threads ~ 36s
4 threads ~ 55s

I changed the parallelization of the benchmark so each thread would fill N consecutive positions of the output array to minimize cache lines collisions but at the end I still get the same behaviour: worse performance with more threads.

@tarunreddy1018 maybe you can test later on your own with 1 thread or more and report back if you do see speed-ups for your implementation (as this workload is probably too light) ;)

I'll open a PR to switch to the unique_lock though as it gives very similar timings to the shared lock and it scores properly :)

tarunreddy1018 commented 3 years ago

Hi, @AlbertoEAF , are all your benchmarks based on putting a unique_lock in the predict function?

I had 2 suggestions that we can try:

Either have a unique_lock in the predict function (Giving up on performance, But good in terms of allocation overhead)
Or Still have a shared_lock but now each call will use its own vector. Thus avoiding race conditions. (Improving on performance, But has slight memory overhead)

I believe 2nd approach is good, because as per the c api code the RowFunctionFromDenseMatric_helper does anyhow create a new vector for each call. And using this RowPairFunctionFromDenseMatric function it creates a vector of pairs. And this is what gets passed to Predictor object. I believe instead of getting vector from this predict_buf_ inside the Predictor object where we copy our feature values to this buffer we can afford to create a new vector. (OR) Can we skip this intermediate step of getting a vector of pairs in RowPairFunctionFromDenseMatric and directly use this vector returned by this RowFunctionFromDenseMatric_helper. Because I could not find any use of that step. I believe we can get over that step.

These are just my thoughts. Because the performance boost that we get with having a shared lock and making it thread safe looks more promising.

std::function<std::vector<double>(int row_idx)>
RowFunctionFromDenseMatric_helper(const void* data, int num_row, int num_col, int is_row_major) {
  const T* data_ptr = reinterpret_cast<const T*>(data);
  if (is_row_major) {
    return [=] (int row_idx) {
      std::vector<double> ret(num_col);
      auto tmp_ptr = data_ptr + static_cast<size_t>(num_col) * row_idx;
      for (int i = 0; i < num_col; ++i) {
        ret[i] = static_cast<double>(*(tmp_ptr + i));
      }
      return ret;
    };
  } else {
    return [=] (int row_idx) {
      std::vector<double> ret(num_col);
      for (int i = 0; i < num_col; ++i) {
        ret[i] = static_cast<double>(*(data_ptr + static_cast<size_t>(num_row) * i + row_idx));
      }
      return ret;
    };
  }
}

std::function<std::vector<std::pair<int, double>>(int row_idx)>
RowPairFunctionFromDenseMatric(const void* data, int num_row, int num_col, int data_type, int is_row_major) {
  auto inner_function = RowFunctionFromDenseMatric(data, num_row, num_col, data_type, is_row_major);
  if (inner_function != nullptr) {
    return [inner_function] (int row_idx) {
      auto raw_values = inner_function(row_idx);
      std::vector<std::pair<int, double>> ret;
      ret.reserve(raw_values.size());
      for (int i = 0; i < static_cast<int>(raw_values.size()); ++i) {
        if (std::fabs(raw_values[i]) > kZeroThreshold || std::isnan(raw_values[i])) {
          ret.emplace_back(i, raw_values[i]);
        }
      }
      return ret;
    };
  }
  return nullptr;
}

JoanFM commented 3 years ago

maybe the best way is to add unique lock for now as quick patch and then maybe we can add a set of functions for multithreaded usage? or that would be too much mantainance overhead?

tarunreddy1018 commented 3 years ago

@JoanFM Yes, agree. We can make it shared_lock or may be as you said have a seperate function for multi threaded usage. If doing so gives good performance sometime later. For now may be I will try out with shared lock approach and try having a new vector for each call and test it out for my use case. If it looks good then may be I will report later. So, that a thought can be given on that direction.

AlbertoEAF commented 3 years ago

Hi @tarunreddy1018 @JoanFM :) As you can see from the PR https://github.com/microsoft/LightGBM/pull/3771 description and the benchmark code the change to unique lock at the PredictSingleRow level is enough to make it fully thread-safe.

Notice that neither RowFunctionFromDenseMatric_helper nor RowPairFunctionFromDenseMatric require locks as they allocate new memory without sharing internal variables.

The lock is only at the PredictSingleRow call and I even tested putting the lock for the Fast case only before the line https://github.com/microsoft/LightGBM/blob/master/src/c_api.cpp#L398 as it is guaranteed it calls a different function (get_row_fun), but the timing was the same.

So a shared_lock wouldn't help us here, except if we go down one extra level of lock granularity, but this is already pretty tricky as we have virtual functions etc and in terms of maintenance would be pretty hard to maintain.

Does it make sense? Or did I miss something in your points?

Regarding @tarunreddy1018 's suggestion of combining RowFunctionFromDenseMatric_helper + RowPairFunctionFromDenseMatric could make sense if we really have a useless intermediate representation (didn't look at it yet).

tarunreddy1018 commented 3 years ago

So a shared_lock wouldn't help us here, except if we go down one extra level of lock granularity, but this is already pretty tricky as we have virtual functions etc and in terms of maintenance would be pretty hard to maintain.

@AlbertoEAF In the above statement, you meant to say that the performance gains that you got with shared lock and unique lock are almost similar right. So, shared lock might not help here right?

JoanFM commented 3 years ago

I think that the measure that needs to be counted is the amount of predictions we can do in a unit of time and not the speed of the prediction right?

AlbertoEAF commented 3 years ago

@AlbertoEAF In the above statement, you meant to say that the performance gains that you got with shared lock and unique lock are almost similar right. So, shared lock might not help here right?

I meant doing locking only inside the predictor exactly where needed in the inner calls of SingleRowPredict, so every statement that is parallelizable could be parallelized. But yes, I also got the same execution time with unique and shared locks on that PR benchmark (though the shared locks produced wrong scores obviously).

I think that the measure that needs to be counted is the amount of predictions we can do in a unit of time and not the speed of the prediction right?

Exactly, as I do 3M predictions, if it had a higher throughput the program would finish sooner, thus measuring total wall time is what we want here.

JoanFM commented 3 years ago

Ok I see @AlbertoEAF thanks!

tarunreddy1018 commented 3 years ago

Thanks @AlbertoEAF @JoanFM

AlbertoEAF commented 3 years ago

Hi again @tarunreddy1018 :)

I tried your idea of merging those two functions. Approach 1 was the laziest, turning the ..._helper() to a function that returns a function (basically computes the offset vs the data pointer), this allows us to have a single representation (pairs (int, value)): https://github.com/microsoft/LightGBM/compare/706f2af7badc26f6ec68729469ec6ec79a66d802...feedzai:single-row-predict-fix-lock-faster-RowPairFunctionFromDenseMatric2

You can check the code above.

I got exactly the same timings as having the two intermediate vectors and going from the 1st vector to the pairs vector.

I don't think there's much chance of improvement here anymore, the bottleneck is somewhere else.

In fact when I was designing the Fast() API, initially you could only use a fixed data pointer for prediction that you set at the ...FastInit() call, and you replaced the *data values inplace. Even then, that alternative which didn't allocate memory per call ran with very little performance differences to the current version which allows using different data adresses per prediction (< 5% execution time savings).

True @tarunreddy1018 , we can still try the other way around, getting rid of the pairs by changing code inside the Predictor but I'm afraid of that because that changes a lot of code, and I don't want to go into all that hassle :) But if you're up for it I can help you get the benchmark going and you can try changing it yourself ;)

AlbertoEAF commented 3 years ago

Ok I just profiled the benchmark, and it's normal that those changes have little/no effect. All the time is spent in internal calls already. Consider as 100% execution time the part of the code with the C API predict call LGBM_BoosterPredictForMatSingleRowFast, the Predict method spent >97% of execution time.

I used valgrind, so I am not accounting for the mutex contention time that we've seen is not negligible as we increased number of threads.

AlbertoEAF commented 3 years ago

Ok, after reading the code over and over I can say 2 things:

Code is highly coupled in the back-end between Booster & Predictor. There's no sane way of changing that. So, if you want full multi-threaded performance you should instead REMOVE the unique lock from the single predictor (thread-unsafe) and you build on your side a thread pool with N threads, and 1 Booster and FastConfigHandle PER thread. That warrants you 100% parallelism :) The biggest overhead is the memory of the duplicated booster, which is not critical to be honest on modern machines, as it's usually < 10MB.
If your features are not sparse we're paying 3 data copies per prediction on MatSingleRow: 1 -> from input dataset "row" to vector , 2 -> to pairs (int, value), 3 -> back to buffer / vector inside the Predictor - the Predictor's buffer. Indeed, it could be optimized.

tarunreddy1018 commented 3 years ago

@AlbertoEAF Right, I agree with that. I will try out as you suggested. Thanks for the information. I guess this is fine for now.

tarunreddy1018 commented 3 years ago

@AlbertoEAF Hi, I have tried with the LGBM_BoosterPredictForMatSingleRowFast where I used a seperate vector for each call. Instead of using the same vector in predict_buf_ for all calls. This made sure that my predictions were consistent and correct.

And also by doing so, the performance is really great for us. And also creating one extra vector for each call did not had much memory overhead as well. This was the only change that I had to do, rest everything the SIngleRowPredictor, Predictor, FastConfigHandle object everything remained the same just 1 object for all the calls. Its just that all the parallel calls to this LGBM_BoosterPredictForMatSingleRowFast use their own seperate vector for its processing.

This was the code which I have tweaked in predictor.hpp

Before Change:

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,
                           double* output) {
          int tid = omp_get_thread_num();
          if (num_feature_ > kFeatureThreshold &&
              features.size() < KSparseThreshold) {
            auto buf = CopyToPredictMap(features);
            boosting_->PredictByMap(buf, output, &early_stop_);
          } else {
            CopyToPredictBuffer(predict_buf_[tid].data(), features);
            boosting_->Predict(predict_buf_[tid].data(), output, &early_stop_);
            ClearPredictBuffer(predict_buf_[tid].data(),
                               predict_buf_[tid].size(), features);
          }
        };

After Change:

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,
                           double* output) {
          if (num_feature_ > kFeatureThreshold &&
              features.size() < KSparseThreshold) {
            auto buf = CopyToPredictMap(features);
            boosting_->PredictByMap(buf, output, &early_stop_);
          } else {
            std::vector<double> predict_buf_;
            predict_buf_.resize(num_feature_, 0.0f);
            CopyToPredictBuffer(predict_buf_.data(), features);
            boosting_->Predict(predict_buf_.data(), output, &early_stop_);
            //Removes all elements in vector
            predict_buf_.clear();
            //Frees the memory which is not used by the vector
            predict_buf_.shrink_to_fit();
          }
        };

Probably we can try this approach, This does not alter any other parts of the code also with minimal change and good performance with not so much memory overhead.

Each of these parallel calls uses num_threads=1.

microsoft / LightGBM

Issue with call to c_api Predict functions in multi threaded way from golang #3751