Open gimseng opened 3 years ago
Hey! I would like to work on this. I was thinking about beginner friendly iris plant classification using KNN as the project, using iris dataset. Let me know about your thoughts!
Hi @Sayoni26, please go ahead and implement the codes. Thanks for contributing ! Please read the contributing guidelines and other previous projects in this repo to understand the format and organization. Looking forward to your PR.
hey @gimseng, can I try this code, I m new to machine learning but i can definitely do decision trees.
Since comments and replies take a lot of time, I m making a PR, please check it and approve it ... I am a first timer here
Hello! I would love to make a contribution. Since I'm also still learning, I'd love to help fellow learners to understand KNN using simple explanations. Thanks!
Hello @gimseng,
I'm enthusiastic about contributing to this task and assisting learners in comprehending the strategies for handling imbalanced datasets effectively. I am interested in creating an informative guide that covers various techniques to address class imbalance in datasets, spanning from simple approaches like resampling to more advanced methods like ensemble techniques and using specialized algorithms.
My plan is to develop a comprehensive tutorial that encompasses the following key aspects:
Introduction to Handling Imbalanced Datasets: Providing an overview of why dealing with class imbalance is crucial in machine learning and the potential challenges it poses.
Resampling Techniques: Explaining the concept of resampling, including both oversampling (e.g., SMOTE) and undersampling and oversampling approaches and when and how to use them. I'll provide practical code examples to demonstrate how to implement these techniques using popular libraries.
Cost-Sensitive Learning: Discussing the concept of cost-sensitive learning and how it can be used to assign different misclassification costs to different classes. I'll include code examples to illustrate its implementation.
Ensemble Techniques: Introducing ensemble methods as a way to improve classification performance on imbalanced datasets. I'll explain how techniques like Balanced Random Forest and EasyEnsemble work and provide code examples.
Using Specialized Algorithms: Highlighting algorithms specifically designed to handle imbalanced data, such as the Adaptive Synthetic Sampling (ADASYN) algorithm. I'll walk through how to use these algorithms and showcase their impact.
Comparative Analysis: Comparing the effects of different techniques on an imbalanced dataset, including their impact on model performance, precision, recall, and F1-score. Visualizations will be included to help learners understand these differences.
Discussion: Engaging in a discussion about the scenarios in which each technique is most suitable, considering the nature of the dataset, the algorithm, and the problem at hand.
For datasets, I'm considering:
Credit Card Fraud Detection Dataset: A widely-used imbalanced dataset, suitable for illustrating the application of various techniques.
Diabetes Classification Dataset: To showcase the handling of class imbalance in a medical context.
Online Retail Dataset: For demonstrating the impact of imbalanced datasets on a real-world e-commerce scenario.
I'm open to feedback and suggestions regarding this plan. My aim is to create a user-friendly and informative resource that equips learners with the knowledge and tools to tackle class imbalance effectively in their machine learning projects.
Best regards, Anurav Modak
Learning Goals
Learn kNN algorithm for supervised classifications. Preferably use the kNN package from scikit-learn.
Prerequisites
Some basic of kNN will be assumed. If scikit-learn is used, some basics of how to install scikit-learn library is assumed.
Data source/summary:
I'm agnostic about which dataset to use, so anything suggested from a textbook exercise/blog is good.