parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.88k stars 332 forks source link

support XGBoost and lightgbm? #83

Closed parrt closed 3 years ago

parrt commented 4 years ago

It would be useful to support decision trees from gradient boosting machines. Should be a simple matter of creating a shadow tree from one of the trees in the boosted ensemble. Interested in this one @tlapusan ?

tlapusan commented 4 years ago

@parrt for sure :). I've thought about this also. In will look soon over it. But how I imagine this:

I know that there is already some implementation for Node, ShadowDecTree... maybe we have already the right setup... I will take a deeper look .

parrt commented 4 years ago

yep, That was my intention for the shadow trees... they are a generic and normal looking tree that is not worried about efficiency, whereas the sklearn stuff is a bunch of parallel arrays to avoid creating objects per nodes. It should be a simple matter of walking the trees from the gradient boosting libraries and creating shadow trees. then all of our stuff will just work.

tlapusan commented 4 years ago

hi @parrt

first visualization for xgb model :

Screen Shot 2020-05-02 at 9 17 29 PM

The code needed several changes, but in the end I think it will be more maintainable and easy to add new models. XGB decision tree API is quite weak in functionalities, it offers only a few tree metadate and I needed to calculate few of them. BUT, this also can be a good think for dtreeviz library...on internet I have found very poor visualisations for xgboost.

Would you like to have in the next day a video call, to discuss more about code changes ?

parrt commented 4 years ago

Heh, cool. Sure. How about Monday. What time zone are you in? I'm in California UTC/GMT -8:00 hours

tlapusan commented 4 years ago

Monday sounds good for me. I'm in Romania, GMT+3. I usually use this website to calculate different timezones : https://www.worldtimebuddy.com/. It says that California is -10... For me would be ok any time between 20:00 and 23:00 pm.

parrt commented 4 years ago

Weird. that should be 11 diff. oh well. that website says it's 10 hours later for you. cool. Ok, how about I email you a zoom URL and we sync up on Monday at 12:00 noon for me and 22:00 for you?

tlapusan commented 4 years ago

Sure, see you then

On Sat, May 2, 2020, 22:18 Terence Parr notifications@github.com wrote:

Weird. that should be 11 diff. oh well. that website says it's 10 hours later for you. cool. Ok, how about I email you a zoom URL and we sync up on Monday at 12:00 noon for me and 22:00 for you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/parrt/dtreeviz/issues/83#issuecomment-623001191, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBYWNTUMHWZNYAFBFXUWCDRPRWZXANCNFSM4LWYDDFA .

tlapusan commented 4 years ago

hi @parrt

I've updated the code architecture based on our discussions, so now ShadowDecTree is an abstract class for Sklearn and XGBoost subclasses. Also visualization's methods take as main parameter the shadow tree object.

We have some more visualizations for XGBoost model. I still have some more work to do (like implement more methods for XGBoost subclass, add docs, additional checks for other visualizations, etc), but I hope in the next few weeks we will have a "production" version for XGBoost.

Screen Shot 2020-05-13 at 10 22 13 AM
parrt commented 4 years ago

Wow! Looking really good.

As far as the user interface, I wonder if it would be helpful to users to (1) pass in a sklearn/xgboost tree directly and (2) passing a shadow tree. The latter case is useful when you have lots of things to ask the same shadow tree, which avoids having to re-create it all the time. On the other hand, I'm not sure it's going to matter much and people might just find it easier to not know about shadow trees. This way we prevent leaking that abstraction to the user.

The code inside can test isinstance() to see which one they passed in. If they pass in shadow tree we are good to go. If a regular tree, we can automatically create a shadow for them.

how does that sound?

tlapusan commented 4 years ago

hi @parrt, (1) sklearn/xgboost tree

(2) shadow tree

Personally, I would prefer the (2) method, but I think we can live with both of them... ex: `dtreeviz(shadow_tree): all the code for visualization....

dtreeviz(raw_model, x_dataset, y_dataset, features, target, args*) shadow_tree = XGBDtree(raw_model, x_dataset, ....) dtreeviz(shadow_tree) `

parrt commented 4 years ago

Hiya. But why not hide all that from the user? They pass in either shadow or real model and we deal with it internally.

tlapusan commented 4 years ago

Hi, you're right, we can deal it internally, and don't double the methods number

Mikemraz commented 4 years ago

Great work! @tlapusan, is your implementation going to support lightGBM? if so, when would you release it?

tlapusan commented 4 years ago

Hi @Mikemraz, right now we are working to add xgboost to the library. We have good progress, but still we have to work on it. After that, we can switch to lightGBM. Did you try scikit-learn version for decision trees ? or are you interested only in lightGBM ?

Mikemraz commented 4 years ago

Thanks for your responce @tlapusan. We have tried scikit-learn version of decision trees and your framework works perfect except it does not support displaying Chinese characters. And In our case, We are only interested in visualizing lightGBM. Please let me know once you have completed the implementation for lightGBM.

gnavvy commented 4 years ago

Hi @Mikemraz, right now we are working to add xgboost to the library. We have good progress, but still we have to work on it. After that, we can switch to lightGBM. Did you try scikit-learn version for decision trees ? or are you interested only in lightGBM ?

Hi @tlapusan, could you share if there're any updates on the xgboost support? We are evaluating whether to build our own version or if there's a WIP version that we can build on top. Thanks!

parrt commented 4 years ago

@gnavvy xgboost is almost ready. @tlapusan and I are doing code review tomorrow.

tlapusan commented 4 years ago

Hi @gnavvy, I'm glad that you have interest for xgboost. Yes, we are almost ready to release xgboost version, I assume in a few days. Would be really helpful to have a feedback from your side after we make the release.

parrt commented 4 years ago

@tlapusan is the branch https://github.com/tlapusan/dtreeviz/tree/support_other_tree_models_%2383 ? @gnavvy wants to peek.

tlapusan commented 4 years ago

@parrt yes, that should be the right branch

tlapusan commented 3 years ago

hi @gnavvy , we've just merged the xgboost into master. Would you or your team have time to give it a try ? Feedback/suggestions for us is very important right now. You can take a look into dtreeviz_xgboost_visualisations.ipynb for more details.

gnavvy commented 3 years ago

hi @gnavvy , we've just merged the xgboost into master. Would you or your team have time to give it a try ? Feedback/suggestions for us is very important right now. You can take a look into dtreeviz_xgboost_visualisations.ipynb for more details.

Tested it locally and worked out-of-box, thanks for the great work! @tlapusan and +100 for abstracting the ShadowDecTree class out. Could you publish a new version on PyPI so we can install and test with real data?

gnavvy commented 3 years ago

Here is my brain dump after glancing through the code:

The ShadowDeCTree expects tree_model, x_data, and y_data among other args as input. This works if the features, target, and the trained model are locally available but might not be scalable where a large scale-data is trained in a distributed fashion (e.g. XGBoost4J-Spark in our case). For tree_model, we can dump and serialize the trained model to a local node for DTreeViz. It can be difficult for x_data and y_data though due to their sheer sizes.

One possible extension is to support overview first and details on-demand, e.g.

  1. allow users to bypass X_train and y_train for class_split_viz and regr_split_viz and only visualize a meta node in such case.
  2. allow users to provide pre-computed feature range and only visualize the decision thresholds in those nodes.

With such flexibility, we can provide an overview of how tree splits, which should be sufficient for guiding the selection of the downstream tasks, even for large/deep trees of thousands of nodes. When users are interested in a particular node of prediction path, we can then serialize and fetch the corresponding data to reduces the I/O.

parrt commented 3 years ago

@gnavvy how common is the distributed spark case?

Sounds complicated to allow data that doesn't fit in memory. Maybe that should be the role of the xgboost spark model; i.e., when asked for a piece of data, it finds it remotely, pulls it over, and returns it to dtreeviz.

gnavvy commented 3 years ago

@gnavvy how common is the distributed spark case?

Pretty much all of our production use cases that use a tree model are on xgboost + Spark, trained & inferenced in parallel.

Maybe that should be the role of the xgboost spark model; i.e., when asked for a piece of data, it finds it remotely, pulls it over, and returns it to dtreeviz.

Yes, I agree that the on-demand data fetching logic is out of the scope of dtreeviz.

I was wondering if we can make X_train, y_train optional at the class_split_viz and regr_split_viz function level. If the whole data array is presented, we draw the scatterplot / pie chart; if only the data range is provided, we draw the bounding box with the decision threshold from the model; if the data info is completely omitted, we show a text label of how the node splits from the model. This may complicate the function APIs a bit but seems feasible?

tlapusan commented 3 years ago

hi @gnavvy, thanks for your feedback. The dtreeviz(args) visualization method wasn't implemented for distributed ML model or very large models, it can have some challenges for plotting. Your suggestions make sense for me, let us few days to think about it, maybe we can find a solution and to improve it ;)

Does XGBoost4J-Spark have the same API as default XGBoost ? I'm asking because I don't know if we can use the ShadowXGBDTree implementation or we should create a new implementation, customized for xgboost4j-spark. Does xgboost4j-spark support python ?

Spark dataframe/dataset doesn't have the indexing concept like pandas dataframe, so this code X_train[:, node.feature()] is not possible. From my knowledge, Spark ML tree models don't provide an API to return sample's node from a distributed dataset, it will be computational expensive.

parrt commented 3 years ago

@tlapusan @gnavvy Maybe we should push this version out as 1.0 and see what people do with it?

tlapusan commented 3 years ago

Agree! 👍

parrt commented 3 years ago

Ok, adding to my list for tomorrow AM when my brain works :)

gnavvy commented 3 years ago

@tlapusan @gnavvy Maybe we should push this version out as 1.0 and see what people do with it?

Please, cannot wait to give it a try.

parrt commented 3 years ago

And....we're live! @tlapusan @gnavvy

parrt commented 3 years ago

@tlapusan do you want to post in the xgboost thread about this to see if they respond? If not, I'll contact directly.

tlapusan commented 3 years ago

I think we can try both :) I will post in that thread

thvasilo commented 3 years ago

Hello @parrt and @tlapusan I skimmed your explanation of the visualizations as I'm trying to take a look at the new XGBoost visualization, thanks for contributing that!

It's a new topic for me so my question is: Is the visualization aimed at single decision trees? How are (potentially large) tree ensembles handled?

tlapusan commented 3 years ago

Hi @thvasilo,

Thanks for your comment. Right now all the visualizations are based only on individual trees. You need to specify a 'tree_index' parameter to point to a tree from a tree ensemble. You could plot multiple visualization for different trees if you need...

You can investigate this notebook for all xgboost's supported visualizations : https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_xgboost_visualisations.ipynb

tlapusan commented 3 years ago

hi @gnavvy, we will soon release a new version with spark support : https://github.com/parrt/dtreeviz/issues/94

If XGBoost4J-Spark has the same API as the Spark DecisionTree (mllib version), it should work also. Regarding the big size of a Spark dataframe to build a ShadowDecTree...a workaround would be to take a small and representative sample and convert it into pandas dataframe. (I have an idea how to remove the need of the dataset for some visualizations, working and thinking on it )

Spark notebook with visualizations : https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_spark_visualisations.ipynb

Let us know if this Spark integration is useful for you :)

gnavvy commented 3 years ago

Thanks for the heads up and the example notebook. We'll give it a try soon. (CC @mmui, @kenns29)

parrt commented 3 years ago

Great! Thanks, @gnavvy. If you can confirm there are no problems, we can do the full release. we just want to have somebody outside of our team test it.