Open guolinke opened 4 years ago
There’s a reference to minimum variance sampling here:
https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html
Although I think it just speeds up training rather than providing out of core training.
I would like to tackle the following issues on Python package. Could I discuss about a plan to fix? Also, where can we discuss that? IMHO, They will be resolved by improving to lightgbm.cv()
function.
I want to reopen the above issues, but I can not do that. Maybe I have no permission.
@momijiame Thank you for your interest! I've unlocked those issues for commenting. Please let's continue the discussion there.
we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:
we would like to call the voting here
Let me start.
It was proposed by me so I'm a little bit biased
Decouple boosting types #3128
GPU binaries release #2263
Enhance parameter tuning guide with more params #2617
Subsampling rows with replacement #1038
Piece-wise linear tree #1315 (also see PR https://github.com/microsoft/LightGBM/pull/3299)
Multi-output regression #524
Cox Proportional Hazard Regression #1837
Based on https://github.com/microsoft/LightGBM/issues/2983#issuecomment-722630931, I've updated this issue's description:
Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked
good first issue
...these should be left open so they are discoverable by new contributors.
I think that we should keep good first issue
issues open, so it's easy for new contributors to find them.
Read from multiple files #2031
Parquet file support #1286
Register custom objective / loss function #3244
Object importance #1460
read from multiple zipped libsvm format text files
Multiple GPU support (#620) (From my experience, the xgboost with gpu seems faster than lightgbm with gpu.)
For everyone who was voting for multi-gpu support, please try our new experimental CUDA version which was kindly contributed by our friends from IBM. This version supports multi-GPU training. We will really appreciate any early feedback on this experimental feature (please create new issues, do not comment here).
How to install: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version-experimental.
Argument to specify number of GPUs: https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_gpu.
Support ignoring some features during training on constructed dataset #4317
Spike and slab feature sampling priors (feature weighted sampling) #2542
Quantile LightGBM: ensure monotonic #3447
SHAP feature contribution for linear trees #4002
Create dataset from pyarrow tables: #3369
Add support for CRLF line endings or improve documentation and error message #5508
Add parameter to control maximum group size for Lambdarank #5053
Allow training without loading full dataset into memory #5094
Support different data types (when load data from Python) #3459
Add support for early stopping in Dask interface #3712
Add Earth Mover Distance as objective metric to be optimized (maximized) #1256
Apache Arrow seems to be gaining a lot of traction in the dataframe space.
We use polars
and it would be great to be able to directly create a dataset from arrow
format.
Also, pandas 2.0
will have arrow
as a backend later this month .
Conan installation support #5770
Add support for Multi-output regression #524
Provide access to the bin ids and bin upper bounds of the constructed dataset #5191
Consider implementation of the sketchboost algorithm for multi output/multiclass setting. The current multiclass approach is highly ineffecient as a separate tree structure is required for each class. This approach significantly improves on training time and model size by allowing a single tree structure to handle many classes.
This is already implemented in the Py-Boost library.
I am currently working on Apache Arrow support and will likely open a PR next week :)
Update: Implementation in https://github.com/microsoft/LightGBM/pull/6022
WebAssembly support (https://github.com/microsoft/LightGBM/issues/5372)
Support monotone constraints with quantile objective #3371
Recalculate feature importance during the update process of a tree model / Calculate Gain Importance on Test Data (#2413)
Add R-package support for an early-stopping min_delta
as implemented in Python #4580 and referenced in #2526.
This issue is to maintain all features request on one page.
Note to contributors: If you want to work for a requested feature, re-open the linked issue. Everyone is welcome to work on any of the issues below.
Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked
good first issue
...these should be left open so they are discoverable by new contributors.Call for Voting
we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:
Discussions
Efficiency related
Effectiveness related
label
(#4483)Distributed platform and GPU (OpenCL-based and CUDA)
Maintenance
CMakeLists.txt
so that it will be possible to build cpp tests with different options, e.g. with OpenMP support (#4125)LGBM_BoosterDumpModel
andLGBM_BoosterSaveModel
(#2604)lib_lightgbm.dll
symbols to Microsoft Symbols Server (#1725)Python package:
HistGradientBoosting
) (#2966, #2628)staged_predict()
in the scikit-learn API (#5031)Dataset
pickleable (#5098)polars
input (#6204)feature_names_in_
and related APIs toscikit-learn
estimators (#6279)parametrize_with_checks
for scikit-learn integration tests (#2947)R package:
lgb.convert_with_rules()
should validate rules (#2682)save_model
to Booster object (#2613)rchk
(#4400)commandArgs
instead of hardcoded stuff in the installation script (#2441)lgb.convert
functions should convert columns of type 'logical' (#2678)lgb.convert
functions should warn on unconverted columns of unsupported types (#2681)lgb.prepare()
andlgb.prepare2()
should be simplified (#2683)lgb.prepare_rules()
andlgb.prepare_rules2()
should be simplified (#2684)lgb.prepare()
andlgb.prepare_rules()
(#3075)New features
find_package
andtarget_link_libraries
(#4067, #3925)min_child_sample
(#5236)New algorithms:
Objective and metric functions:
Python package:
logging.Logger
(#4783)Dask:
num_threads
(#3714)init_model
(#4063)LGBMModel
(#3845)train()
function (#3846)cv()
function (#3847)DaskDataset
(#3944)pred_contrib
results for multiclass classification with sparse matrices (#4438)DaskLGBMClassifier.predict()
andLGBMClassifier.predict()
(#3881)raw_score
inpredict()
(#3793)init_score
(#3807)pred_leaf
inpredict()
(#3792)predict()
(#3713)R package:
lgb.cv()
(#3924)cb.reset.parameters()
(#2665)lgb.Dataset
inPredictor$predict()
(#2666)pkgdown >2.0
(#4859)lgb.cv()
(#4911)readRDS()
andsaveRDS()
(#4296)New language wrappers:
Input enhancements:
ChunkedArray
in C API) (#3995, https://github.com/microsoft/LightGBM/pull/3997#issuecomment-791969953)to_numpy()
method as it currently is) (#2003)