Closed PabloGGaray closed 1 month ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
View / edit / reply to this conversation on ReviewNB
aloctavodia commented on 2024-05-23T13:55:22Z ----------------------------------------------------------------
Use az.style.use("arviz-darkgrid")
Remove plt.rcParams["figure.dpi"] = 300
View / edit / reply to this conversation on ReviewNB
aloctavodia commented on 2024-05-23T13:55:23Z ----------------------------------------------------------------
The pdp plot, together with the Variable Importance plot, confirms that Tail
is the covariable with the smaller effect over the predicted variable. In the Variable Importance plot Tail
is the last covariable to be added and does not improve the result, in the pdp plot Tail
has the flattest response.
View / edit / reply to this conversation on ReviewNB
aloctavodia commented on 2024-05-23T13:55:23Z ----------------------------------------------------------------
Add to the next section and compare with the PPC plot or remove it
View / edit / reply to this conversation on ReviewNB
aloctavodia commented on 2024-05-23T13:55:24Z ----------------------------------------------------------------
So far we have a very good result concerning the classification of the species based on the 5 covariables. However, if we want to select a subset of covariable to perform future classifications is not very clear which of them to select. Maybe something sure is that Tail
could be eliminated. At the beginning when we plot the distribution of each covariable we said that the most important variables to make the classification could be Wing
, Weight
and, Culmen
, nevertheless after running the model we saw that Hallux
, Culmen
and, Wing
, proved to be the most important ones. 
Unfortunatelly, the partial dependence plots show a very wide dispersion, making results look suspicious. One way to reduce this variability is adjusting 3 independent trees, below we will see how to do this and get a more accurate result.
View / edit / reply to this conversation on ReviewNB
aloctavodia commented on 2024-05-23T13:55:25Z ----------------------------------------------------------------
Fitting independent trees
The option to fit independent trees with pymc-bart is set with the parameter pmb.BART(..., separate_trees=True, ...). As we will see, for this example, using this option doesn't give a big difference in the predictions, but helps us to reduce the variability in the ppc and get a small improvement in the in-sample comparison. In case this option is used with bigger datasets you have to take into account that the model fits more slowly, so you can obtain a better result at the expense of computational cost. The following code runs the same model and analysis as before, but fitting 3 independent trees. Compare the time to run this model with the previous one
PabloGGaray commented on 2024-05-23T16:00:54Z ----------------------------------------------------------------
It's ok the "3" in "but fitting 3 independent trees."?
aloctavodia commented on 2024-05-23T16:06:16Z ----------------------------------------------------------------
Well, it is 3 independent "sum of trees". Better to remove the "3"
View / edit / reply to this conversation on ReviewNB
aloctavodia commented on 2024-05-23T13:55:26Z ----------------------------------------------------------------
Now we are going to reproduce the same analyses as before.
Well, it is 3 independent "sum of trees". Better to remove the "3"
View entire conversation on ReviewNB
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2024-05-23T16:12:08Z ----------------------------------------------------------------
Line #7. Hawks = pd.read_csv(pm.get_data("marketing.csv"))[
Is this a copy/paste error? I assume we don't want marketing.csv.
Yes, it's a copy/paste error, thanks.
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2024-05-23T16:12:10Z ----------------------------------------------------------------
Second sentence needs some cleanup/rewording. Maybe something like:
Still, none of the variables have a marked separation among the species distributions such that they can cleanly separate them.
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2024-05-23T16:12:11Z ----------------------------------------------------------------
First sentence needs some rewording. Perhaps something like:
It may be that some of the input variables are not informative for classifying by species, so in the interest of parsimony and in reducing the computational cost of model estimation, it is useful to quantify the importance of each variable in the dataset.
Closes https://github.com/pymc-devs/pymc-bart/issues/100
📚 Documentation preview 📚: https://pymc-examples--663.org.readthedocs.build/en/663/