hski-github / spotfire-sankey-diagram

Spotfire mods visualization for Sankey diagram
MIT License
4 stars 4 forks source link

Sankey diagram to visualize Decision Tree results #29

Open hski-github opened 2 years ago

hski-github commented 2 years ago

Hi there, great mod! I'm wondering if it could be used to visualize the decision tree outputs from Spotfire's classification/regression tree tool? The tool's outputs would probably need to be modified but the idea does not seem to be crazy: https://www.greenbook.org/mr/market-research-methodology/sankey-diagrams-a-better-way-to-visualize-decision-trees/ Cheers, Mark

See comment on TIBCO Community page https://community.tibco.com/wiki/sankey-diagram-mod-tibco-spotfirer

Mark-iGit commented 2 years ago

I've played a little bit with the mod but I cannot figure out a proper data format allowing to use it for decision tree visualizations. Is this me being stupid or is this related to the other comment on TIBCO community ("Would it be possible to have the input table set up just like the setup which is needed for example the NetworkD3 R package? So one column with the category, value, from and to. With this setup it is possible to have different values for the same category, to be used in for examples material flows where you have losses. ")?

hski-github commented 2 years ago

Could you provide some example data, what you get out of the decision tree?

Mark-iGit commented 2 years ago

Since I couldn't figure out a suitable data format I'll attach an image instead. The data set is classifying products into one of three classes (above spec, in spec or below spec) based on processing parameters such as Temperature, Pressure or Etch_Rate. Each of the nodes has an ID and a count of parts falling into that node (N): image

No need to try and capture all of it but I think the important bit is that a given variable, e.g. Etch_Rate, can show up as splitting factor with various values at various levels of the tree. I've tried to provide data in tabular form for some of the nodes but again, couldn't figure out a suitable format: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ID | L0_Temperature | L0_Pressure | L0_Etch_Rate | L1_Temperature | L1_Pressure | L1_Etch_Rate | L2_Temperature | L2_Pressure | L2_Etch_Rate | L3_Temperature | L3_Pressure | L3_Etch_Rate | Count -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 1 | all | all | all |   |   |   |   |   |   |   |   |   | 251 2 |   |   |   | <=68.8 |   |   |   |   |   |   |   |   | 152 3 |   |   |   | >68.8 |   |   |   |   |   |   |   |   | 93 4 |   |   |   | <=68.8 |   |   |   |   | <=20 |   |   |   | 55 5 |   |   |   | <=68.8 |   |   |   |   | >20 |   |   |   | 87 16 |   |   |   | >68.8 |   |   |   |   | <=22 |   |   |   | 7 17 |   |   |   | >68.8 |   |   |   |   | >22 |   |   |   | 81 6 |   |   |   | <=68.8 |   |   |   |   | >20 |   |   | <=20.8 | 42 7 |   |   |   | <=68.8 |   |   |   |   | >20 |   |   | >20.8 | 45

hski-github commented 2 years ago

Can you maybe describe what raw data you used and the steps to calculate the decision tree classification?

Mark-iGit commented 2 years ago

It is a synthetic data set where I had simulated the outcome of a critical dimension (CD) measurement on many parts based on Temperature, Pressure and Etch_Rate of the manufacturing process. The decision tree was then asked to find a model which explains the classification of the CD measurement result (too large = above spec, just right = in spec, or too small = below spec). It does so by splitting all the measured parts based on Temperature, Pressure or Etch_Rate always selecting a split in a way the improves the prediction (it was Random Forest and I have visualized the result from one of the trees). It is not so much about the data or model but in principle if the data can be turned into a format which would allow visualizing how many parts are split of e.g. by the first split (Temperature >68.8 or <= 68.8) and then how many say of the one with >68.8 are again split of by Etch_Rate<22 and so on. Similar to what is described e.g. here: https://www.greenbook.org/mr/market-research-methodology/sankey-diagrams-a-better-way-to-visualize-decision-trees/ Does this make sense?