parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.94k stars 331 forks source link

Pie charts show for data that classifier was trained on and not new data #269

Closed HannahAlexander closed 1 year ago

HannahAlexander commented 1 year ago

We fit a model using a DecisionTreeClassifier on upsampled data. We then wanted to visualise this model using the original data. In the plot all the numbered labels are correct but the ratios in the final pie charts are for the upsampled data.

tlapusan commented 1 year ago

It would be better to use the same dataset for both training and dtreeviz visualisations. Internally, dtreeviz is using both the tree metadata and the dataset sent as paramenter.

btw, what ml library are you using ?

HannahAlexander commented 1 year ago

Sklearn. Okay that makes sense. I have a highly skewed dataset so I'm training my model on upsampled data, but I then want to visualise the decision tree using the fitted classifier on the original data. Is there a way to do this?On 23 Feb 2023 18:27, Tudor Lapusan @.***> wrote: It would be better to use the same dataset for both training and dtreeviz visualisations. Internally, dtreeviz is using both the tree metadata and the dataset sent as paramenter. btw, what ml library are you using ?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

tlapusan commented 1 year ago

Based on the current dtreeviz implementation, I think no :( But it's good that you raised this issue, we could take it into consideration as a next possible feature.

HannahAlexander commented 1 year ago

That's great thank you!On 25 Feb 2023 10:30, Tudor Lapusan @.***> wrote: Based on the current dtreeviz implementation, I think no :( But it's good that you raised this issue, we could take it into consideration as a next possible feature.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

thomsentner commented 1 year ago

I'm running into a similar problem. I trained a decision tree on my train set, but would like to visualize its performance on the test set. When I use the test set as input for dtreeviz.model, the plots are incorrect, the plots show a weird mix of the data I gave it and the data the model was trained on.

As a train test split is a very common procedure, is there no workaround for this?

tlapusan commented 1 year ago

hi @thomsentner, I put here some limitations : https://github.com/parrt/dtreeviz/issues/269#issuecomment-1442238799

btw, that library are you using ? I will try to take a look the next days, for sklearn we could have a chance to make it work for new data, I guess.

thomsentner commented 1 year ago

Thanks. I'm using sklearn as well.

tlapusan commented 1 year ago

Looking into the code, the change supposed to be small.. but things get complicated a little in case the class_weight parameter is used at model training. I have to spend more time to better understand the overall picture.

thomsentner commented 1 year ago

I created a PR with some very quick and dirty changes that already seem to solve the issue I faced personally. Hopefully this helps in resolving this issue.

Original behavior is preserved:

Screenshot 2023-03-18 at 13 50 47

X_test is displayed correctly:

Screenshot 2023-03-18 at 13 50 24
tlapusan commented 1 year ago

@thomsentner thanks for the PR, I just observed it while creating my PR also for this issue :d https://github.com/parrt/dtreeviz/pull/282

I will take a look also on yours ;)

parrt commented 1 year ago

I'm not sure how I feel about this. Point of this library is to visualize how a decision tree carves up feature space and makes decisions based upon the training data. The only roll for testing data is to see how a specific test case would run down the tree right? How would you show a decision tree for data that was not part of the construction of that tree? To me that means you simply train a new tree on the testing data and show that. Sorry if I am misunderstanding

thomsentner commented 1 year ago

@parrt for me it would be to visualize the validation dataset, and as such, visualize the true world performance I can expect from the tree, not so much to test just any specific test case. Looking at training samples will give me a very biased view of what will be happening.

tlapusan commented 1 year ago

@HannahAlexander your feedback would also help :)

tlapusan commented 1 year ago

@parrt the plan is to use the tree structure/metadata learned from training set and make the plots based on another dataset, like validation.

As we know, an important step in any ML project is to do a good train/validation (and even test) split, which should reflect the production data. Only interpreting the tree based on train dataset, doesn't mean that the model will perform the same in production also.

For example let's say we have 92% accuracy on train and 80% on validation (or even 99%). The question is why ? Interpreting the tree structure(learned from train) and making visualisations based on validation data should help to get the answers.

tlapusan commented 1 year ago

Here are some visualisation which could help to understand the purpose. The first viz is based on trian data, the second on validation and the third one on the same validation, but I randomly changed the target values (it's an exaggeration :)) but just to serve our purpose when a train/validation split is not correct)

Screenshot 2023-03-21 at 09 34 49
tlapusan commented 1 year ago

Another useful think would be to compare in parallel the visualisations for train and validation and to check for some differences. Ideally the node value distribution/ranges should be the same, but sometime they are not.

Screenshot 2023-03-21 at 09 39 18
tlapusan commented 1 year ago

@parrt any thoughts on this ? :)

parrt commented 1 year ago

Sorry for the delay. I'm 100% focused on some machine learning stuff at work haha.

Ok, I think I understand the purpose now. You want a mechanism to visualize how the tree structure interprets the validation set in a large sense instead of where we run a single test instance down the tree now. In other words the tree structure does not change but the distributions in the decision notes and the leaf know it does, according to the information in the validation set. Do I have that correct?

tlapusan commented 1 year ago

Indeed, it's for the entire validation set and the tree structure (decision split nodes learned during training) doesn't change. In other words, it's how the tree sees/interprete/predict on a new dataset (which is different from training).

It would be a pretty powerful feature I think for the library. No other ones allow this, from what I know :)

parrt commented 1 year ago

I'm a bit nervous about the feature, even though I see the utility and understanding a large validation set. Would this require a lot of changes or add complexity to the code base?

tlapusan commented 1 year ago

It should be a minimal change in the code. For sklearn is this PR https://github.com/parrt/dtreeviz/pull/282/commits/648c24be56413aff3242534583e4057ea260e652 which as we can see there are just few lines of code. For the other libraries it should be the same, but I have to double check.

Still, I propose to do this change for sklearn first and to see the community feedback.

parrt commented 1 year ago

Still, I propose to do this change for sklearn first and to see the community feedback

Agreed, let's give it a try and let people report back.

thomsentner commented 1 year ago

@tlapusan tried working with your PR already and it seems to work great! Just running into one problem still, if a non-leaf node contains only one class in the test set, the plot of that node will fail. dtreeviz tries to plot both classes in all non-leaf nodes, but cannot find any data for the opposite class in this case. For me it occurs for the node just above these leafs: image

tlapusan commented 1 year ago

@thomsentner nice that you had time to check it ! thanks

The issue which you mentioned happened to me also and we just merged a PR for it into master few days ago..https://github.com/parrt/dtreeviz/pull/284

@parrt we have https://github.com/parrt/dtreeviz/pull/282 for this issue. If we can merge it, @thomsentner should have the mentioned issue solved.

parrt commented 1 year ago

@tlapusan merged! thanks :)

parrt commented 1 year ago

Resolved by #282

mepland commented 1 year ago

Sorry I've been busy getting up to speed in my new position, but I'm very glad this was implemented - a great addition to dtreeviz!

parrt commented 1 year ago

no prob! I'm super busy too!