Closed tarunreddy1018 closed 3 years ago
Hello @tarunreddy1018,
First of all, interesting that you only need a new predict_buf_
and can share everything else, that might be useful :)
But did you try using the regular implementation with the lock and without setting num_threads=1
?
I ask this because we might have different "optimal" implementations & use cases. For instance, if the model is not trivially simple, unless I'm mistaken a prediction is already parallelized internally, and hence doing several predictions in parallel with 1 thread vs 1 prediction at a time but with multiple threads each. I'm not sure this is better, and it might vary according to # trees, average tree depth and # features.
In this implemenation you're removing that opportunity of parallelism inside the predict call right?
Also take into account LightGBM is actually optimized for prediction in batch, which also relies on this code, thus this might degrade its performance and probably is not acceptable by the core devs.
By the way, due to C++'s RAII, predictbuf will be deleted automatically after the closing brace, so no need to call clear & shrink_to_fit. ;)
Hello @tarunreddy1018,
First of all, interesting that you only need a new
predict_buf_
and can share everything else, that might be useful :)But did you try using the regular implementation with the lock and without setting
num_threads=1
?I ask this because we might have different "optimal" implementations & use cases. For instance, if the model is not trivially simple, unless I'm mistaken a prediction is already parallelized internally, and hence doing several predictions in parallel with 1 thread vs 1 prediction at a time but with multiple threads each. I'm not sure this is better, and it might vary according to # trees, average tree depth and # features.
In this implemenation you're removing that opportunity of parallelism inside the predict call right?
Also take into account LightGBM is actually optimized for prediction in batch, which also relies on this code, thus this might degrade its performance and probably is not acceptable by the core devs.
_By the way, due to C++'s RAII, predictbuf will be deleted automatically after the closing brace, so no need to call clear & shrink_tofit. ;)
I agree with @AlbertoEAF I think these new predict functions are suited to have multiple predictions with different threads, while OpenMP tries to have one prediction with multiple threads.
Maybe these 2 distinctions can be made when choosing the predict function to use, but the mantainance overhead may be a little too much
@AlbertoEAF yes, agree with that. Probably In my use case having num_threads=1
worked well and it looks like my code is changed in favour to that where I am removing the possibility to parallelize individual predictions with openmp which might not be ideal for other case scenario and might not be general and be inclined to a specific case if we make the change.
@tarunreddy1018 can we close this issue?
If you want to see improvements regarding threading maybe open a new issue like a feature request regarding threading support with no locking and mention this issue and https://github.com/microsoft/LightGBM/issues/3675 so we know all the details we discussed.
@AlbertoEAF Sure, you can close this issue for now. Thanks
@tarunreddy1018 I don't have permissions, must be you closing it :)
Fixed via #3771.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
How we are using LightGBM:
We are using the LightGBM c_api in our service. But this c_api is being called from golang. Basically we have a golang wrapper around the c_api. And we call the c_api functions from golang using cgo.
We get the “lib_lightgbm.so” library file from the Github release section and use them.
Version of LightGBM being used:
3.1.1 (Observed with all versions >= 3.0.0)
LightGBM component:
C++ Library
Environment info:
Operating System: Observed on both Linux and MacOS
Architecture: x86_64
CPU model: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
C++ compiler version: gcc version 7.2.0 (Debian 7.2.0-1)
CMake version: 3.9.0
GLIBC version: ldd (Debian GLIBC 2.28-10) 2.28
Context:
Starting with LightGbm 3.0.0 version, there is an update in Predict functions where only the necessary parts of the predict function have been given with "unique_lock", and rest of the parts where they just do computation were given with "shared_lock".
Unlike the previous version like 2.3.1, where a "lock_guard" was used in the beginning of the predict function itself. This led to the situation where only one thread can execute this predict function at a time. This could not scale better in production environments where we do many predictions in a parallel way.
So, we decided to upgrade to a newer version of LightGbm 3.1.1. Which really improved the performance. Since now multiple threads can execute the predict function in a parallel way because of the "shared_lock".
Issue:
But, Now we saw that the predictions were not consistent. That is if there are many threads which are invoking this predict function in a parallel way on a large scale, And if we give the same input multiple times to the service, it turns out that we are getting different predictions each time we make a call.
This was not the issue with earlier LightGBM version that we used 2.3.1. All the predictions were consistent with that version. It could probably be because of the lock that it had in the predict function because of which we did not notice this issue.
To test this out quickly, we had to put a lock before we made a call to predict function. And now all the predictions were as expected and consistent.
So, we thought that there could be some race condition in the code or LightGbm was not meant to do predictions in a parallel way. To investigate it further we started looking into the c_api of predict function. And found out that it has a vector of vectors from which each thread gets its vector using thread id (tid) and copies the input data into that vector and passes it to predict function. Once the predict call is completed it clears the vector.
Definition of Vector of Vectors:
std::vector<std::vector<double, Common::AlignmentAllocator<double, kAlignedSize>>> predictbuf
Code where Vector of Vectors is used:
So, we suspected that the problem could be arising from this vector of vectors and tried logging the tid into a file and surprisingly we saw that all of the calls were logging "tid" as "0".
This means that all the threads were using the same vector with "tid" as "0". Which led to a race condition.
Could you please let us know if the latest update was not meant to do predictions in a parallel way or Could it be some issue with our golang code calling c_api (Probably with openMP)?