henry-254 / CapstoneProjects

0 stars 0 forks source link

Capstone Projects Review 1 #1

Open okothchristopher opened 1 year ago

okothchristopher commented 1 year ago
  1. Rename the Repository to Machine Learning/Data Science Capstone Projects
  2. Create a Subfolder to each capstone project, here you will have :
    • The Data,
    • The problem Statement,
    • The solution based on the different tech stacks, it is possible you may use more than just a notebook for the solution.
    • A write up, sort of a final project. The main aim of doing the above is for reproducibility.
  3. Name Each Capstone Appropriately, once you have created the subfolders.
okothchristopher commented 1 year ago

I have sent you an email on some of the uses cases to help you spruce up your final report. These are centered around,
The uses of being able to predict the probability of having a bank account based on demographic data. To surmise:

  1. Targeting financial services to the right people, for the banks, DFIs and the Government.
  2. Designing policies and programs to promote financial inclusion
  3. Monitoring the impact of financial inclusion initiatives
okothchristopher commented 1 year ago

You need to have sections in your notebook i.e

  1. Data Exploration
  2. Data Prep
  3. Modelling
  4. Model Selection
  5. Model Improvement
  6. etc.
okothchristopher commented 1 year ago

Data exploration

  1. In data exploration, when it comes to some continuous variables, refrain from checking things like number of unique values, case in point: _number of unique values in age_ofrespondent
  2. This section/code block is redundant, data.isnull().sum(), given that you have done a data.info() somewhere in the process.
  3. In commenting about the insights gathered, avoid using a third persona tone. Eg, instead of remarking, This can help understand the family dynamics within the dataset., the best way to uproach it, would be, This helps us understand ....
  4. Gender sensitivity is also key, eg, here was actually the head of the household himself
  5. On exploring distribution of categorical groups, eg here, always sort first, before plotting, so that someone can clearly see, where the majority lies.
  6. This section Visualization, should be renamed to bivariate eda or sth along those lines. Refer to SVM project by Cyril.
  7. Input line 116. Be consistent with how you communicate your outcomes, they should be separate sentences within the notebook not commented out code blocks.

Data Preparation for Modelling/ Data Preprocessing

  1. Why LabelEncoder, and not a combination of LabelEncoder and OneHot Encoder for appropriate scenarios. Look at the below articles for more:

  2. What is the rationale behind doing the correlation plot and what insights have you gathered from that.

  3. Class imbalance treatment, You mention that In this case i decided to do oversampling, as it gives more on model performance as opposed to downsampling. This is not a good enough reason why one would opt for oversampling. Please re check on class imbalance techniques, each with their own advantages and disadvantages, the choose the most appropriate, eg SMOTE.

Modelling

  1. You cannot evaluate multiple models with different characteristics. This needs to be informed by your EDA. Otherwise trying just a bunch of models is indicative of not understanding the use cases of each model. Eg, for non-linearly separable data, which models would be best suited, etc.
  2. You also cannot just jump to hyperparameter tuning without a proper model selection process preceding it. This is bordering on the sacrilegious

Model Evaluation

  1. It seems your data is not linearly separable, since even the performance of your best model, logistic regression does not look so good.
  2. Try other non-linear models and select parameters appropriately.
  3. Given that this is a Zindi project, please predict the classes for the test data and post your score for my assessment.

\\