Added new chapter - 006

anandravishankar12 commented 3 years ago

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Have added a new chapter which includes ensemble techniques like Bagging, Boosting, Random Forest among others. As this is my first contribution to any OSS, let me know what changes are to be made. Meanwhile I will add stacking to the set of codes.

Any other comments?

gimseng commented 3 years ago

@anandravishankar12 Thanks for the contribution ! I briefly looked through the codes, will review it in details later in the week. Just a few comments:

Please rename the folder to 006. We had 005 in the pipeline, but still haven't finished polishing it yet.
Can you please combine all the .ipynb into ones? That way, you can also discuss in the one .ipynb file which performs better, what are the pros and cons of various methods.
How did you fine tune the hyperparameters of each model? Did you do cross-validation model selection?
Finally, maybe put more words/documentations in the .ipynb file. It helps yourself and the readers to learn what each line/section of lines is doing.
(optional) For future purpose, maybe create a branch of your forked branch, and name it something meaningful, like ensemble or something. Its more for best practice.

Overall, looks great, and thanks for the contribution !

gimseng commented 3 years ago

link to #52

anandravishankar12 commented 3 years ago

@gimseng Thanks for providing insight into the requirements. I have made some changes;

Folder renamed to 006.
All the methods are combined into a single .ipynb. Will include the pros and cons in the next commit along with any additional changes. (Sorry, was in a bit of a rush)
Hyper-parameter tuning were conducted by having a set of possible configurations. Cross-validation was conducted automatically in the last step, by choosing different subsets of datasets to be trained on different basis functions.
Added some documentation. Will add a line by line documentation in next commit.

gimseng commented 3 years ago

@anandravishankar12 Thanks ! I'll wait for your new updates before running the codes.

Meanwhile, could you double check your readme.md in the exercise folder? The task section seems to be a copy-paste from other project. I think it'd be helpful to say a few words about the data, and the problem statement (is it classification / regression, and if its to predict survivability or etc?). Thanks !

anandravishankar12 commented 3 years ago

@gimseng Thanks for pointing that out :P I was using the previous chapters as a reference. I have modified it and added a readme in /data. The problem statement is now specified in the /exercise/readme.md.

I have added the comments in the areas where I thought the need was. As in most cases only the classifier was changed, I didn't think repeating the comments was necessary, so I have left that for the time being. Kindly review it and let me know.

Cheers

gimseng commented 3 years ago

@anandravishankar12 Thanks for the updates. I think you have addressed my previous suggestions and its great that there are more explanations/docs in the notebook. There are still a few things that I think will be helpful for the readers:

Could you have all the libraries import all in the beginning of the notebook, perhaps before preprocessing? Its a good practice to do so and to avoid repetitions.
Train-test split and predictor-drop seem to happen more than once. Could you consolidate that into a section, like processing or something before model (i.e. bagging section)? This way we don't need to do it more than once.
Could you double check that the bagging (random forest) section run without error for you? I just ran through everything in colab, but it gave errors for the brf.fit(X_train, y_train) part of this section. All other sections ran fine. It gave an error ValueError: object of too small depth for desired array.
In bagging (decision tree), it threw some error about safe_indexing, if you got the same thing, perhaps try to suppress the warning or something. I had trouble running the stacking section, when you declared StackingCVRegressor, it thew an TypeError not liking random_state. See if you got the same errors on colab. Maybe you are using a different dependencies / versions of libraries, if so please specify.

Overall I think you have done a great job having very useful ensemble techniques in one project. Thanks for including more docs both on the theory and the coding side. Once the suggestions above have been addressed, it looks good to me to merge.

gimseng commented 3 years ago

@anandravishankar12 There is also something strange about the column names, such as texture_0ean and various column names end with 0ean. Do you know what's going on? Is it a bug in preprocessing or some artifacts from the original data labels?

anandravishankar12 commented 3 years ago

Yeah it wasn't a bug. I had a project related to this dataset and the letter 'm' was recurring in some values. So I did a FindAll+Replace, and so every 'm' got replaced. I'll change the column names in the next commit.

anandravishankar12 commented 3 years ago

@gimseng I have made the following changes:

Created a separate section for importing libraries.
I have used the data set provided for all but AdaBoost classifier (for which I used make_moons data set). Reasoning being that, I wanted to first introduce the reader to some sample techniques, then provide visualization capabilities of the decision boundaries. However this data set's boundaries weren't clear. So I used a built-in data set. I have reverted back the original data set for the rest of classifiers.
Even I encountered the same issue on Colab. Have corrected them. Coming to the libraries and the dependencies, the project is currently running on the latest versions.
I've re-modeled the Stacking classifier. It works flawlessly now.

gimseng commented 3 years ago

@anandravishankar12 Thanks for the updates ! I am able to run all codes, and the analysis is rather comprehensive and it is really useful for me personally.

One last thing before I merge: regarding the AdaBoost par using different dataset, could you comment on this either at the beginning of the notebook or in the readme.md file just so that its clear to learners using the notebook and not following this PR. I was a bit confused about the switch to a different dataset. Another solution is that in the AdaBoost, you could use the moon dataset for illustration but will also include the breast-cancer data set and comment on how the decision boundary is not clear. This will help the learners to understand the limitation of this method too. I'll leave you to decide on this.

Overall, great job ! Will merge after you figured out the point raised above. Thanks !

anandravishankar12 commented 3 years ago

@gimseng I have provided information regarded plotting of decision boundaries towards the end of /006/exercise/readme.md. I decided to leave it to the reader to make plots for this data set. It would be a bit of a challenge as compared to plotting a ready-made data set.

Cheers

gimseng commented 3 years ago

ok @anandravishankar12 thanks ! I'll merge your project. Thanks for contributing !

gimseng / 99-ML-Learning-Projects