Open gregpawin opened 3 years ago
Created two reddit-scraping functions and began gathering collection of subreddits/keywords of interest for TF-IDF analysis. Requested subreddit/keyword suggestions from group which can be added to this google sheet. Will be taking a short break and return on 07/07.
@KarinaLopez19 @zhao-li-github This issue has not had an update since 2021-08-03. If you are no longer working on this issue please let us know. If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines
@gregpawin Please reformat the Overview on this issue to conform to our new template for Lucky parking
### Dependencies
ANY ISSUE NUMBERS THAT ARE BLOCKERS OR OTHER REASONS WHY THIS WOULD LIVE IN THE ICEBOX
### Overview
WE NEED TO DO X FOR Y REASON
### Action Items
A STEP BY STEP LIST OF ALL THE TASK ITEMS THAT YOU CAN THINK OF NOW EXAMPLES INCLUDE: Research, reporting, etc.
### Resources/Instructions
REPLACE THIS TEXT -If there is a website which has documentation that helps with this issue provide the link(s) here.
@gregpawin Is this issue still in progress? I'm looking for projects for the Data Science team and this look like a good one to assign if it's available.
→ Used stemming to reduce the bag-of-words to its stem → used this list of words to visualise the top 20 words.
→ parking_subreddit = subreddit.search('parking', time_filter = 'all') As suggested, I tried changing the time filter to ‘all’ to get the older reddit data - did not see much change in the output (top 20 words) - the words mostly indicate of some shooting incident rather than parking issues.
→ Used different search criteria to scrap reddit data and then visualise the top 20 words parking_subreddit = subreddit.search('vandwellers', time_filter = 'all'). Results seem relevant to the search criteria.
→ Going through spacy tutorial and redoing bag-of-words and TF-IDF using spacy
Progress: Removed the park names and related titles from the list of words. Working on n-grams. Blockers: Get the parking related words as top words for further analysis. Availability: 6 hrs. ETA: This week.
Progress: N-grams (unigram, bigrams and trigrams) before the stopwords were removed. Blockers: Working on understanding LDA topic modelling. Availability: 6 hrs. ETA: This week.
Dependencies
None
Overview
To help understand what the needs of people living in Los Angeles are, one way to gather more user data is to analyze discussion boards such as Reddit. The Reddit analysis branch contains some tools to help download Reddit data concerning parking issues in Los Angeles
Action Items
Resources/Instructions
Reddit-analysis branch Starter notebook Python Reddit API Wrapper How to use PRAW NLP Cleaning Wikipedia Using NLTK