Open jlstro opened 6 years ago
This is interesting work! I think it would be very interesting to compare over time, say, a month now and month 10 years or 20 years ago. Has anything changed?
I'd take care also of the The Times' analysis and compare gender bias between two of the main media outlets in the US and Germany. Looking forward to see the results
Bad news: Still no cool screenshots or drafts of graphs. I have some visualization ideas, but spent almost all of the energy on the scraping/data organizing part so far. Good news: I have 5 years of full text data from the German news website now. It is a lot of text and will allow for a lot of mining :-)
Yes. Thanks to Adrian's suggestion, I will now make a thorough analysis of the German text corpus I got and compare it to the results of the New York Times.
The tokenizing is a problem. Also the attribution of male/female words. Lot of manual work.
what if you divided the data by the gender of the writer? you could then see if men and women write differently about men and women. you wouldn't have to compare one org to another, you could compare it to itself. you could also compare it to the nyt if you've started that. but divide that by gender as well.
TEXT IS WORK IN PROGRESS
The project lives here
These are the charts I came up with:
Headline:
Published website version:
Code repository: It's here and will still be cleaned and updated over the weekend Final data set(s): too large to upload
I ende up with almost 8 million sentences, each 5 words on average I guess. My computer haunted me with memory errors in pandas when I tried to do more fancy stuff such as tf/idf so I ended up with a relatively simple count/percentage.
Not really happy with it. Maybe I'll put some more work into it at a later stage. It needs more cleaning and I'd also like to tackel the change over time analysis if I had more time.
Pitch
Summary
The aim of this project is to use text analysis methods to test the texts of Germany's big online news sites for gender bias in the reported stories and news pieces.
I took inspiration from Neal Caren, who did something similar for the NYT. I will probably make use of the TF/IDF method though, but we'll see...
Details
Possible headline(s): Using data to show the gender bias in XYZ's reporting In German media, Women do X, Men do Y
Data set(s): My first step is to scrape texts and I started with Spiele Online, the largest online news outlet. In a best-case scenario, I'll have data from 3-4 outlets from a significant (identical) time period - one year would be great. The backup plan is to use texts from one site only and compare over time. Code repository: will update later Possible problems/fears/questions: The biggest problem is the scraping and cleaning of the data. Then I will at some point have to make a decision about which tokens I qualify as female/male.
Work so far
I did a tf/idf analysis for one month of Spiegel Online data as a test. The results are rather boring at that point (it's a ton of vectors), so no screenshots.
Checklist
This checklist must be completed before you submit your draft.