Closed parrt closed 3 years ago
@parrt for sure :). I've thought about this also. In will look soon over it. But how I imagine this:
I know that there is already some implementation for Node, ShadowDecTree... maybe we have already the right setup... I will take a deeper look .
yep, That was my intention for the shadow trees... they are a generic and normal looking tree that is not worried about efficiency, whereas the sklearn stuff is a bunch of parallel arrays to avoid creating objects per nodes. It should be a simple matter of walking the trees from the gradient boosting libraries and creating shadow trees. then all of our stuff will just work.
hi @parrt
first visualization for xgb model :
The code needed several changes, but in the end I think it will be more maintainable and easy to add new models. XGB decision tree API is quite weak in functionalities, it offers only a few tree metadate and I needed to calculate few of them. BUT, this also can be a good think for dtreeviz library...on internet I have found very poor visualisations for xgboost.
Would you like to have in the next day a video call, to discuss more about code changes ?
Heh, cool. Sure. How about Monday. What time zone are you in? I'm in California UTC/GMT -8:00 hours
Monday sounds good for me. I'm in Romania, GMT+3. I usually use this website to calculate different timezones : https://www.worldtimebuddy.com/. It says that California is -10... For me would be ok any time between 20:00 and 23:00 pm.
Weird. that should be 11 diff. oh well. that website says it's 10 hours later for you. cool. Ok, how about I email you a zoom URL and we sync up on Monday at 12:00 noon for me and 22:00 for you?
Sure, see you then
On Sat, May 2, 2020, 22:18 Terence Parr notifications@github.com wrote:
Weird. that should be 11 diff. oh well. that website says it's 10 hours later for you. cool. Ok, how about I email you a zoom URL and we sync up on Monday at 12:00 noon for me and 22:00 for you?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/parrt/dtreeviz/issues/83#issuecomment-623001191, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBYWNTUMHWZNYAFBFXUWCDRPRWZXANCNFSM4LWYDDFA .
hi @parrt
I've updated the code architecture based on our discussions, so now ShadowDecTree is an abstract class for Sklearn and XGBoost subclasses. Also visualization's methods take as main parameter the shadow tree object.
We have some more visualizations for XGBoost model. I still have some more work to do (like implement more methods for XGBoost subclass, add docs, additional checks for other visualizations, etc), but I hope in the next few weeks we will have a "production" version for XGBoost.
Wow! Looking really good.
As far as the user interface, I wonder if it would be helpful to users to (1) pass in a sklearn/xgboost tree directly and (2) passing a shadow tree. The latter case is useful when you have lots of things to ask the same shadow tree, which avoids having to re-create it all the time. On the other hand, I'm not sure it's going to matter much and people might just find it easier to not know about shadow trees. This way we prevent leaking that abstraction to the user.
The code inside can test isinstance()
to see which one they passed in. If they pass in shadow tree we are good to go. If a regular tree, we can automatically create a shadow for them.
how does that sound?
hi @parrt, (1) sklearn/xgboost tree
(2) shadow tree
Personally, I would prefer the (2) method, but I think we can live with both of them... ex: `dtreeviz(shadow_tree): all the code for visualization....
dtreeviz(raw_model, x_dataset, y_dataset, features, target, args*) shadow_tree = XGBDtree(raw_model, x_dataset, ....) dtreeviz(shadow_tree) `
Hiya. But why not hide all that from the user? They pass in either shadow or real model and we deal with it internally.
Hi, you're right, we can deal it internally, and don't double the methods number
Great work! @tlapusan, is your implementation going to support lightGBM? if so, when would you release it?
Hi @Mikemraz, right now we are working to add xgboost to the library. We have good progress, but still we have to work on it. After that, we can switch to lightGBM. Did you try scikit-learn version for decision trees ? or are you interested only in lightGBM ?
Thanks for your responce @tlapusan. We have tried scikit-learn version of decision trees and your framework works perfect except it does not support displaying Chinese characters. And In our case, We are only interested in visualizing lightGBM. Please let me know once you have completed the implementation for lightGBM.
Hi @Mikemraz, right now we are working to add xgboost to the library. We have good progress, but still we have to work on it. After that, we can switch to lightGBM. Did you try scikit-learn version for decision trees ? or are you interested only in lightGBM ?
Hi @tlapusan, could you share if there're any updates on the xgboost support? We are evaluating whether to build our own version or if there's a WIP version that we can build on top. Thanks!
@gnavvy xgboost is almost ready. @tlapusan and I are doing code review tomorrow.
Hi @gnavvy, I'm glad that you have interest for xgboost. Yes, we are almost ready to release xgboost version, I assume in a few days. Would be really helpful to have a feedback from your side after we make the release.
@tlapusan is the branch https://github.com/tlapusan/dtreeviz/tree/support_other_tree_models_%2383 ? @gnavvy wants to peek.
@parrt yes, that should be the right branch
hi @gnavvy , we've just merged the xgboost into master. Would you or your team have time to give it a try ? Feedback/suggestions for us is very important right now. You can take a look into dtreeviz_xgboost_visualisations.ipynb for more details.
hi @gnavvy , we've just merged the xgboost into master. Would you or your team have time to give it a try ? Feedback/suggestions for us is very important right now. You can take a look into dtreeviz_xgboost_visualisations.ipynb for more details.
Tested it locally and worked out-of-box, thanks for the great work! @tlapusan and +100 for abstracting the ShadowDecTree class out. Could you publish a new version on PyPI so we can install and test with real data?
Here is my brain dump after glancing through the code:
The ShadowDeCTree
expects tree_model
, x_data
, and y_data
among other args as input. This works if the features, target, and the trained model are locally available but might not be scalable where a large scale-data is trained in a distributed fashion (e.g. XGBoost4J-Spark in our case). For tree_model
, we can dump and serialize the trained model to a local node for DTreeViz. It can be difficult for x_data
and y_data
though due to their sheer sizes.
One possible extension is to support overview first and details on-demand, e.g.
X_train
and y_train
for class_split_viz
and regr_split_viz
and only visualize a meta node in such case.With such flexibility, we can provide an overview of how tree splits, which should be sufficient for guiding the selection of the downstream tasks, even for large/deep trees of thousands of nodes. When users are interested in a particular node of prediction path, we can then serialize and fetch the corresponding data to reduces the I/O.
@gnavvy how common is the distributed spark case?
Sounds complicated to allow data that doesn't fit in memory. Maybe that should be the role of the xgboost spark model; i.e., when asked for a piece of data, it finds it remotely, pulls it over, and returns it to dtreeviz.
@gnavvy how common is the distributed spark case?
Pretty much all of our production use cases that use a tree model are on xgboost + Spark, trained & inferenced in parallel.
Maybe that should be the role of the xgboost spark model; i.e., when asked for a piece of data, it finds it remotely, pulls it over, and returns it to dtreeviz.
Yes, I agree that the on-demand data fetching logic is out of the scope of dtreeviz.
I was wondering if we can make X_train
, y_train
optional at the class_split_viz
and regr_split_viz
function level. If the whole data array is presented, we draw the scatterplot / pie chart; if only the data range is provided, we draw the bounding box with the decision threshold from the model; if the data info is completely omitted, we show a text label of how the node splits from the model. This may complicate the function APIs a bit but seems feasible?
hi @gnavvy, thanks for your feedback. The dtreeviz(args) visualization method wasn't implemented for distributed ML model or very large models, it can have some challenges for plotting. Your suggestions make sense for me, let us few days to think about it, maybe we can find a solution and to improve it ;)
Does XGBoost4J-Spark have the same API as default XGBoost ? I'm asking because I don't know if we can use the ShadowXGBDTree implementation or we should create a new implementation, customized for xgboost4j-spark. Does xgboost4j-spark support python ?
Spark dataframe/dataset doesn't have the indexing concept like pandas dataframe, so this code X_train[:, node.feature()]
is not possible. From my knowledge, Spark ML tree models don't provide an API to return sample's node from a distributed dataset, it will be computational expensive.
@tlapusan @gnavvy Maybe we should push this version out as 1.0 and see what people do with it?
Agree! 👍
Ok, adding to my list for tomorrow AM when my brain works :)
@tlapusan @gnavvy Maybe we should push this version out as 1.0 and see what people do with it?
Please, cannot wait to give it a try.
And....we're live! @tlapusan @gnavvy
@tlapusan do you want to post in the xgboost thread about this to see if they respond? If not, I'll contact directly.
I think we can try both :) I will post in that thread
Hello @parrt and @tlapusan I skimmed your explanation of the visualizations as I'm trying to take a look at the new XGBoost visualization, thanks for contributing that!
It's a new topic for me so my question is: Is the visualization aimed at single decision trees? How are (potentially large) tree ensembles handled?
Hi @thvasilo,
Thanks for your comment. Right now all the visualizations are based only on individual trees. You need to specify a 'tree_index' parameter to point to a tree from a tree ensemble. You could plot multiple visualization for different trees if you need...
You can investigate this notebook for all xgboost's supported visualizations : https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_xgboost_visualisations.ipynb
hi @gnavvy, we will soon release a new version with spark support : https://github.com/parrt/dtreeviz/issues/94
If XGBoost4J-Spark has the same API as the Spark DecisionTree (mllib version), it should work also. Regarding the big size of a Spark dataframe to build a ShadowDecTree...a workaround would be to take a small and representative sample and convert it into pandas dataframe. (I have an idea how to remove the need of the dataset for some visualizations, working and thinking on it )
Spark notebook with visualizations : https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_spark_visualisations.ipynb
Let us know if this Spark integration is useful for you :)
Thanks for the heads up and the example notebook. We'll give it a try soon. (CC @mmui, @kenns29)
Great! Thanks, @gnavvy. If you can confirm there are no problems, we can do the full release. we just want to have somebody outside of our team test it.
It would be useful to support decision trees from gradient boosting machines. Should be a simple matter of creating a shadow tree from one of the trees in the boosted ensemble. Interested in this one @tlapusan ?