Open cfelke opened 6 years ago
Hi Catharina
I like that you trying to do a TF-IDF-Project. And even though I am supposed to talk about your project here, I want to share with you Sarah's feedback on a text analysis project I was trying to do that might help you:
«[…]To handle word counts people often do bubble charts. And working with language is really rough because 1 word can have so many meanings. Just having the words doesn't tell me what type of press release it is.[…]»
I think you should focus on a particular question. For example: Are there releases from regions that are very similar or very different? This is where you can use the strength of the model, otherwise you should rather read the press releases yourself as you can make more sense of it than the script.
I am really looking forward to see the results of your analysis, how your TF-IDF worked out and sailed into the harbour ;) !
Hey wait, what I said! ;)
Perhaps you can look for one word and then see if a set of other words regularly accompanies it. Or, you can start by gathering up major events that the epa will have to deal with, and then see how many press releases come out around those events and if the sentiment shifts.
I'm glad we are all feeling the TF-IDF love.
I'm in the same boat with one of my projects, trying to figure out how to visualize is tricky. Perhaps there is a way to show spikes in similarity/difference between two press releases? This way you would not need to worry about pulling out individual words and counting them (laziness in all things). You could map time against a numeric change from the previous statement. Perhaps you will discover interesting spikes due to change in leadership?
Pitch
Summary
I scraped all the press releases from the Environmental Protection Agency (EPA) since 2017 to analyze their content by applying TF-IDF.
Details
I have the date, the name of the EPA agency (=region) which published the press release and the actual text of each press release. I want to know which topics the EPA prefers to talk about (clusters), what they are choosing to emphasize on (what words are mostly used).
Later, I could compare the various EPA agencies to another regarding the content of their press releases. Which topics are most important in each region? Are there any differences? Are there any specific dates when many press statements related to one topic/issue were published?
Possible headline(s):
Data set(s): I scraped the press releases from the EPA's official site
Code repository: https://github.com/cfelke/data-studio/tree/master/code/04_EPA_press_releases
Possible problems/fears/questions: TF-IDF may cause some headache, yep.
Work so far
I scraped the data and saved it in a dataframe. This project isn't so much about design but rather about text analysis, so I'm not sure to what extent I can visualize my findings in the end. Buzzfeed did some text analysis, I'd say I drew my inspiration from them but rather in a conceptual than in a visual way.
Checklist
This checklist must be completed before you submit your draft. [x] I have already spent time with my data set, opening it, exploring it, etc [x] I have created a "DIARY.md" file to save links and list all of the terrible, no good problems I come across [x] My issue links to my data set(s) [x] My issue links to my code repository [x] My issue explains what I'd like to explore in the data set [x] My issue includes images - either inspiration or what I've done so far