NLP analysis of Reddit data

hackforla / lucky-parking

Visualization of parking data to assist in understanding of the effects of parking policies on a neighborhood by neighborhood basis in the City of Los Angeles

https://www.hackforla.org/projects/lucky-parking.html

33 stars 61 forks source link

NLP analysis of Reddit data #176

Open gregpawin opened 3 years ago

gregpawin commented 3 years ago

Dependencies

None

Overview

To help understand what the needs of people living in Los Angeles are, one way to gather more user data is to analyze discussion boards such as Reddit. The Reddit analysis branch contains some tools to help download Reddit data concerning parking issues in Los Angeles

Action Items

[x] Download the reddit-analysis branch or download and setup PRAW directly.
[ ] Try out different search terms to optimize getting relevant information regarding people's issues with parking in Los Angeles
[x] Document exploratory data analysis using Jupyter notebooks:
- [ ] Starter code: https://github.com/hackforla/lucky-parking/blob/reddit-analysis/notebooks/1.0-gp-initial_eda.ipynb
[x] Create some code to clean the data
[ ] Do some classic NLP analysis--i.e. TF-IDF
[ ] (Optional) Use more modern NLP toolsets--i.e. spaCy

Resources/Instructions

Reddit-analysis branch Starter notebook Python Reddit API Wrapper How to use PRAW NLP Cleaning Wikipedia Using NLTK

KarinaLopez19 commented 3 years ago

Created two reddit-scraping functions and began gathering collection of subreddits/keywords of interest for TF-IDF analysis. Requested subreddit/keyword suggestions from group which can be added to this google sheet. Will be taking a short break and return on 07/07.

ExperimentsInHonesty commented 2 years ago

@KarinaLopez19 @zhao-li-github This issue has not had an update since 2021-08-03. If you are no longer working on this issue please let us know. If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines

Progress: "What is the current status of your project? What have you completed and what is left to do?"
Blockers: "Difficulties or errors encountered."
Availability: "How much time will you have this week to work on this issue?"
ETA: "When do you expect this issue to be completed?"
Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

ExperimentsInHonesty commented 2 years ago

@gregpawin Please reformat the Overview on this issue to conform to our new template for Lucky parking

### Dependencies
ANY ISSUE NUMBERS THAT ARE BLOCKERS OR OTHER REASONS WHY THIS WOULD LIVE IN THE ICEBOX

### Overview
WE NEED TO DO X FOR Y REASON

### Action Items
A STEP BY STEP LIST OF ALL THE TASK ITEMS THAT YOU CAN THINK OF NOW EXAMPLES INCLUDE: Research, reporting, etc.

### Resources/Instructions
REPLACE THIS TEXT -If there is a website which has documentation that helps with this issue provide the link(s) here.

akhaleghi commented 2 years ago

@gregpawin Is this issue still in progress? I'm looking for projects for the Data Science team and this look like a good one to assign if it's available.

PratibhaNagesh commented 1 year ago

→ Used stemming to reduce the bag-of-words to its stem → used this list of words to visualise the top 20 words.

→ parking_subreddit = subreddit.search('parking', time_filter = 'all') As suggested, I tried changing the time filter to ‘all’ to get the older reddit data - did not see much change in the output (top 20 words) - the words mostly indicate of some shooting incident rather than parking issues.

→ Used different search criteria to scrap reddit data and then visualise the top 20 words parking_subreddit = subreddit.search('vandwellers', time_filter = 'all'). Results seem relevant to the search criteria.

→ Going through spacy tutorial and redoing bag-of-words and TF-IDF using spacy

PratibhaNagesh commented 1 year ago

Progress: Removed the park names and related titles from the list of words. Working on n-grams. Blockers: Get the parking related words as top words for further analysis. Availability: 6 hrs. ETA: This week.

PratibhaNagesh commented 1 year ago

Progress: N-grams (unigram, bigrams and trigrams) before the stopwords were removed. Blockers: Working on understanding LDA topic modelling. Availability: 6 hrs. ETA: This week.