BART: Categorical example

PabloGGaray commented 1 month ago

Closes https://github.com/pymc-devs/pymc-bart/issues/100

📚 Documentation preview 📚: https://pymc-examples--663.org.readthedocs.build/en/663/

review-notebook-app[bot] commented 1 month ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-05-23T13:55:22Z ----------------------------------------------------------------

Use az.style.use("arviz-darkgrid")

Remove plt.rcParams["figure.dpi"] = 300

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-05-23T13:55:23Z ----------------------------------------------------------------

The pdp plot, together with the Variable Importance plot, confirms that Tail is the covariable with the smaller effect over the predicted variable. In the Variable Importance plot Tail is the last covariable to be added and does not improve the result, in the pdp plot Tail has the flattest response.

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-05-23T13:55:23Z ----------------------------------------------------------------

Add to the next section and compare with the PPC plot or remove it

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-05-23T13:55:24Z ----------------------------------------------------------------

So far we have a very good result concerning the classification of the species based on the 5 covariables. However, if we want to select a subset of covariable to perform future classifications is not very clear which of them to select. Maybe something sure is that Tail could be eliminated. At the beginning when we plot the distribution of each covariable we said that the most important variables to make the classification could be Wing, Weight and, Culmen, nevertheless after running the model we saw that Hallux, Culmen and, Wing, proved to be the most important ones.

Unfortunatelly, the partial dependence plots show a very wide dispersion, making results look suspicious. One way to reduce this variability is adjusting 3 independent trees, below we will see how to do this and get a more accurate result.

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-05-23T13:55:25Z ----------------------------------------------------------------

Fitting independent trees

The option to fit independent trees with pymc-bart is set with the parameter pmb.BART(..., separate_trees=True, ...). As we will see, for this example, using this option doesn't give a big difference in the predictions, but helps us to reduce the variability in the ppc and get a small improvement in the in-sample comparison. In case this option is used with bigger datasets you have to take into account that the model fits more slowly, so you can obtain a better result at the expense of computational cost. The following code runs the same model and analysis as before, but fitting 3 independent trees. Compare the time to run this model with the previous one

PabloGGaray commented on 2024-05-23T16:00:54Z ----------------------------------------------------------------

It's ok the "3" in "but fitting 3 independent trees."?

aloctavodia commented on 2024-05-23T16:06:16Z ----------------------------------------------------------------

Well, it is 3 independent "sum of trees". Better to remove the "3"

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

aloctavodia commented on 2024-05-23T13:55:26Z ----------------------------------------------------------------

Now we are going to reproduce the same analyses as before.

PabloGGaray commented 1 month ago

It's ok the "3" in "but fitting 3 independent trees."?

View entire conversation on ReviewNB

aloctavodia commented 1 month ago

Well, it is 3 independent "sum of trees". Better to remove the "3"

View entire conversation on ReviewNB

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-05-23T16:12:08Z ----------------------------------------------------------------

Line #7.        Hawks = pd.read_csv(pm.get_data("marketing.csv"))[

Is this a copy/paste error? I assume we don't want marketing.csv.

_PabloGGaray commented on 2024-05-23T17:10:33Z_ ----------------------------------------------------------------

Yes, it's a copy/paste error, thanks.

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-05-23T16:12:10Z ----------------------------------------------------------------

Second sentence needs some cleanup/rewording. Maybe something like:

Still, none of the variables have a marked separation among the species distributions such that they can cleanly separate them.

review-notebook-app[bot] commented 1 month ago

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-05-23T16:12:11Z ----------------------------------------------------------------

First sentence needs some rewording. Perhaps something like:

It may be that some of the input variables are not informative for classifying by species, so in the interest of parsimony and in reducing the computational cost of model estimation, it is useful to quantify the importance of each variable in the dataset.

PabloGGaray commented 1 month ago

Yes, it's a copy/paste error, thanks.

View entire conversation on ReviewNB

pymc-devs / pymc-examples

BART: Categorical example #663