kiranrawat / Detecting-Fake-News-On-Social-Media

Flask web application that aims to predict fake news over social media using NLP and Machine Learning.
5 stars 3 forks source link

Round 1 comments, suggestions and questions #1

Closed rfazeli closed 3 years ago

rfazeli commented 3 years ago

Great work so far, especially the explanations you have in your feature_selection.ipynb and Modeling.ipynb notebooks makes them very easy to follow and shows your ML theory knowledge.

Here is my feedback:

General Suggestions

  1. Usually you don't add your data to git. Instead, you can add a link (e.g. Google Drive) for downloading the data or add a little script for downloading the data from its source. But in this case, I think we can skip this as the data is relatively small.
  2. Put your .csv files in a data/ folder and your notebooks in a notebooks/ folder. Later on, when you refactor your notebooks into Python scripts you can add the scripts in a scripts/ or src/ folder.
  3. Add more explanation in your Data_Preparation.ipynb notebook. Use the markdown cells to add heading for each section and explain what is happening in each section, what your thinking was for doing certain things, and what is your interpretation of the results. For example, you can explain why you have created different distribution plots and what they imply, and how they impact your decisions down the road.
  4. It'd also be good to explain why you're using a different set of features for the Naive Bayes pipeline compared to other classifiers you explore in Modeling.ipynb notebook.
  5. Add some explanations for SVM and Random Forest similar to what you have for Naive Bayes and Logistic Regression in Modeling.ipynb notebook.

Specific Comments

  1. I think you need to print() this line in order to see the result. https://github.com/kiranrawat/Detecting-Fake-News-On-Social-Media/blob/b0e9aee3cbdc2845a2f0626c060caf16ffcba118/Data_Preparation.py#L131
  2. It's better to read in the .csv files directly as opposed to importing Data_Preparation just to access the dataframes https://github.com/kiranrawat/Detecting-Fake-News-On-Social-Media/blob/b0e9aee3cbdc2845a2f0626c060caf16ffcba118/feature_selection.py#L7
  3. I think you mean fake or not instead of spam or not https://github.com/kiranrawat/Detecting-Fake-News-On-Social-Media/blob/b0e9aee3cbdc2845a2f0626c060caf16ffcba118/feature_selection.py#L50
  4. You mention stemming but you don't apply it anywhere https://github.com/kiranrawat/Detecting-Fake-News-On-Social-Media/blob/b0e9aee3cbdc2845a2f0626c060caf16ffcba118/feature_selection.py#L107
  5. Your explanations in this section and the extreme scenarios you consider are perfect. Just make sure you format it a bit nicer so that it's easy to read. https://github.com/kiranrawat/Detecting-Fake-News-On-Social-Media/blob/b0e9aee3cbdc2845a2f0626c060caf16ffcba118/Modeling.py#L64
  6. Again it's probably less confusing if you import CountVectorizer and TfidfTransformer classes directly from sklearn and recreate these instances as opposed to importing them from the feature_selection.ipynb notebook https://github.com/kiranrawat/Detecting-Fake-News-On-Social-Media/blob/b0e9aee3cbdc2845a2f0626c060caf16ffcba118/Modeling.py#L95

Questions

  1. It seems like the .csv files in the main directory are clean/processed versions of the .tsv files in liar_dataset/. Correct? If so you can put the original .tsv files in data/raw/ and the processed files in data/processed/.
kiranrawat commented 3 years ago

Hi Reza, I have resolved the issue.