microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.52k stars 3.82k forks source link

[RFC] Add FreeForm2 Into LightGBM #4742

Closed shiyu1994 closed 2 years ago

shiyu1994 commented 2 years ago

Summary

FreeForm2 is a module that transforms original feature values according to the arithmetic and logic expressions specified by users. The transformed features will be used as inputs to the model for training. We plan to integrate FreeForm2 into LightGBM.

Motivation

Quote the description from @chjinche in #4733: FreeForm2 is a flexible type of feature transforms, created by Microsoft Core Ranking and used widely over Microsoft production model training. As the name indicates, FreeForm2 empowers users to compose a free combination of features as they like. It is expressed by formulas to be applied in the model inputs. The surface syntax is s-expression, with parentheses in a LISP-like fashion to delimit. FreeForm2 has implicit type systems and evaluate a single, nested expression that returns a floating-point number. Please see more info about FreeForm in doc PR #4735.

Integration of FreeForm2 allows LightGBM to be more widely used in our products.

Description

We plan to divide the integration into 3 parts:

  1. Add the source code of FreeForm2 into LightGBM code base (PR #4733). The source code was originally a part of our product code, and we've got the approval to open source it. Though this is a large PR, it will have no interference with our current code base.
  2. Adding interface in LightGBM allows use of FreeForm2 end-to-end with the model training. This PR adds two parameters for data construction from file, transform_file and header_file. transform_file describes the transform expressions, and header_file is a separate file for the original feature names. This PR also adds some tests and enables checks for transform in both Github action workflow and Azure Pipeline. (#4734)
  3. Update the documents for how to use transform in LightGBM (#4735)
shiyu1994 commented 2 years ago

@StrikerRUS @jameslamb @guolinke @btrotta Let's have discussion about adding the FreeForm2 feature here. Feel free to express any concerns and suggestions. Thanks!

jameslamb commented 2 years ago

Thanks for this additional information. This seems like a very interesting and powerful project!

But I am concerned about the proposal to add it directly to LightGBM.

Could you please explain why adding this code to the LightGBM repo is preferable to releasing the project under its own repo?

I don't agree with the statement that this change "will have no interference with our current code base".

shiyu1994 commented 2 years ago

Hi @jameslamb, after some discussion with @chjinche, we agree that we can make the large PR #4733 a separate repo, and use it in LightGBM as an external library, just like compute, eigen and fmt. Do you think that's acceptable?

shiyu1994 commented 2 years ago

By the way, I updated your post above to change the wrongly referred PR number #4374 to #4734.

chjinche commented 2 years ago

hi @jameslamb Thanks for your insights! And thanks @shiyu1994 for proposing the discussion. I'd like to share more comments from my perspective. Feel free to share your views, and hope we could solve the issue asap.

StrikerRUS commented 2 years ago

Thanks for inviting to this discussion!

I'm absolutely agree with each point @jameslamb has already written above. I just want to highlight some moments about what I'm personally concerned.

As it was fairly noted, the main purpose of adding FreeForm2 to the LightGBM repo is to help developers use it inside different Microsoft departments.

The purpose of this integration is to enable LightGBM to be used in some of our products.

Integration of FreeForm2 allows LightGBM to be more widely used in our products.

We want to provide a built-in transform way to lightGBM, because it is a very common practice in tree based model training in Microsoft products, and benefits us a lot, now we are glad to use lightGBM as model in products and need to fill this gap of data transformation.

I agree that this project is quite interesting and it's great that Microsoft decided to open-source it. For sure, this project can gain some popularity among ML community. But it's debatable statement that it should be a part of LightGBM itself as a general-purpose tree-based gradient boosting framework. Adding FreeForm2 can shift development direction to an autoML field.

I strongly believe that a separate project with LightGBM as a dependency is the best option. Thanks to git and GitHub it can be done by various methods. Going this way maintainers attention won't be diffused by issues and PRs to FreeForm2.

I'm a kind of OK with a lightweight PR adding some core interface enhancements without which FreeForm2 can't work. But it'll be much better if they done inside FreeForm2 own repository as they are not used anywhere else. But I'm strongly against adding new CI jobs into LightGBM repo. We already have a lot, really a lot, of CI jobs to cover as much possible combinations of different environments as we can. Unfortunately, the number of CI jobs are not unlimited and we made a lot of efforts to overcome this without significantly losing the coverage. One another frustrating thing is that CI environments are not stable and sometimes (to be honest, quite often) we spend several days to simply repair something in a CI environment. During that time PRs are blocked for the merge and even are not being reviewed because of the limit of human resources here. Summing up all the above, any problems with FreeForm2 CI will hurt the development speed of LightGBM. Sorry, but I really can't get it why this all should be a part of this repo.

jameslamb commented 2 years ago

I updated your post above to change the wrongly referred PR number

ha thank you! sorry about that.

could we name the repo as lightgbm-transform? Cause the transform interfaces are designed specially for lightgbm usage, not general.

Oh wow! Until this comment I did not know that this "FreeForm2" project is intended to be LightGBM-specific.

If that's true then yes, I think it makes sense to include lightgbm- in the name.

we can make the large PR #4733 a separate repo, and use it in LightGBM as an external library, just like compute, eigen and fmt

Thank you very much. I definitely think that is preferable. That makes it clearer that that other repo is the place where issues and pull requests related to this large library should go.

the compliation is controlled by a new option USE_TRANSFORM with default value OFF. No impact to original codes if not turn on

I don't agree with this reasoning. As maintainers of this project, we're responsible for supporting all configurations of LightGBM. And users should expect that if we provide an alternative configuration, it has the full support of the projects' maintainers unless it is explicitly documented as not being well-supported.

USE_GPU and USE_CUDA are also OFF by default in this project, and we still receive many bug reports, feature requests, and pull requests related to those types of builds. Those requests do not take up any less maintainer time and energy just because those options are OFF by default.

Could you please kindly review PR Integrate transform implementation with lightGBM, add separate header file support. #4734 in advance?

I took a quick look, and see some immediate concerns:

  1. It looks like FreeForm2 is only supported using gcc on unix-like systems (linux + mac). That makes it far less portable than LightGBM itself, which is supported on linux, mac, and windows using MSVC, gcc, clang, and others. When users ask for support to be extended to other systems where they're already using LightGBM...will you work on such requests? Or is the vision for this feature that it will be gcc-only, unix-like-only until someone from the community volunteers to try to add support for other OSes or compilers?

  2. How do you expect users of the Python package to take advantage of this? Will you propose in a future PR that we build and publish wheels to PyPI that include the transform code? As of #4734, I think the only way someone would be able to use this feature in the Python package is to clone the repo and build from source.

  3. Similar question for R....the main package manager for R, CRAN, has very strict requirements for code that requires compilation, including a requirement that it work on all of the platforms CRAN considers important (several linux flavors, Windows, macOS, Solaris...and with both gcc and clang). Do you anticipate some people from Microsoft adding support for this in the R package in the future?

No need to answer each of those specific questions here (to avoid this thread getting too large), but I just raise them as some tangible examples of the maintenance costs that would be introduced by moving forward with this proposal.

shiyu1994 commented 2 years ago

@StrikerRUS @jameslamb, thanks for your suggestions. I think these PRs are only initial steps towards the full support of FreeForm2. And we can spend efforts in the future to make the support of FreeForm2 more portable. Actually, we have some ideas to create a separate library for data handling in LightGBM, which can contain current preprocessing logics in LightGBM (such as discretization), and the target encoding in PR #3234, and FreeForm2. This allows LightGBM to focus more on the GBDT algorithm. And the data library can handle feature engineering related issues.

chjinche commented 2 years ago

@StrikerRUS @jameslamb, thanks for your comments. With some discussions, we figured out a way that may solved concerns you raised :wink:. Could we add a reflection mechanism for base class Parser in lightGBM? In this way, users could use lightgbm as dependency and customize the parser they like to do data preprocessing in a built-in way. And no extra ci and limited compiler/os problems are introduced. We could prepare examples and tutorials in lightgbm-transform repo, user could go there and learn how to play with parser.

StrikerRUS commented 2 years ago

@chjinche Thank you very much for your response!

In this way, users could use lightgbm as dependency and customize the parser they like to do data preprocessing in a built-in way.

I believe this is the best way to go! And I think it's OK to add some API changes that are required by lightgbm-transform. We are doing almost the same with the SWIG wrapper: API is added in this repo but all tests are done in consumer's repository and we sometimes receive bug fixes (e.g. #2958).

user could go there and learn how to play with parser.

Please don't forget to add some links in LightGBM docs to lightgbm-transform with the aim to help users discover it (e.g. like here #4584).

chjinche commented 2 years ago

Thanks @StrikerRUS for the plan approval! We will prepare a new pr soon as we discussed.

jameslamb commented 2 years ago

I don't understand what "a reflection mechanism for base class Parser in LightGBM" means, but as long as LightGBM isn't responsible for compiling FreeForm2 and pull requests / tests / bug reports for FreeForm2 live in a separate repo, I'm supportive of that proposal. Thanks for finding an alternative integration plan.

chjinche commented 2 years ago

hi @shiyu1994 @jameslamb @StrikerRUS hope you are doing well! 😄 I've sent out the new integration PR #4782, could you please kindly review it? And looks most ci jobs passed except three R ones, but I did not change R package nor c api, could you please help take a look or share some simple debug way? Thanks a lot.

chjinche commented 2 years ago

Thanks @shiyu1994 for review the PR! @jameslamb @StrikerRUS May I know if you have time to take a look? 😄 all ci jobs passed now.

jameslamb commented 2 years ago

May I know if you have time to take a look?

Sorry for the delay, I've been traveling this week. I just reviewed and I'm comfortable with those changes, just looks like there is a bit of extra work to do to ensure the library continues to meet the strict portability requirements for CRAN. (https://github.com/microsoft/LightGBM/pull/4782#pullrequestreview-804412121)

StrikerRUS commented 2 years ago

Hey @chjinche ! Can the branch feat/enable_transform_to_lgbm be removed now from the LightGBM repo?

StrikerRUS commented 2 years ago

Gently ping @chjinche

chjinche commented 2 years ago

Hey @chjinche ! Can the branch feat/enable_transform_to_lgbm be removed now from the LightGBM repo?

Sorry for the late response, deleted the unused branch.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.