jlstro commented 6 years ago

Pitch

Summary

The aim of this project is to use text analysis methods to test the texts of Germany's big online news sites for gender bias in the reported stories and news pieces.

I took inspiration from Neal Caren, who did something similar for the NYT. I will probably make use of the TF/IDF method though, but we'll see...

Details

Possible headline(s): Using data to show the gender bias in XYZ's reporting In German media, Women do X, Men do Y

Data set(s): My first step is to scrape texts and I started with Spiele Online, the largest online news outlet. In a best-case scenario, I'll have data from 3-4 outlets from a significant (identical) time period - one year would be great. The backup plan is to use texts from one site only and compare over time. Code repository: will update later Possible problems/fears/questions: The biggest problem is the scraping and cleaning of the data. Then I will at some point have to make a decision about which tokens I qualify as female/male.

Work so far

I did a tf/idf analysis for one month of Spiegel Online data as a test. The results are rather boring at that point (it's a ton of vectors), so no screenshots.

Checklist

This checklist must be completed before you submit your draft.

[x] I have already spent time with my data set, opening it, exploring it, etc
[ ] I have created a "DIARY.md" file to save links and list all of the terrible, no good problems I come across
[ ] My issue links to my data set(s)
[ ] My issue links to my code repository
[x] My issue explains what I'd like to explore in the data set
[ ] My issue includes images - either inspiration or what I've done so far

Palarisk commented 6 years ago

This is interesting work! I think it would be very interesting to compare over time, say, a month now and month 10 years or 20 years ago. Has anything changed?

adrianblanco commented 6 years ago

I'd take care also of the The Times' analysis and compare gender bias between two of the main media outlets in the US and Germany. Looking forward to see the results

jlstro commented 6 years ago

Update

Your project content: images/words/etc

Bad news: Still no cool screenshots or drafts of graphs. I have some visualization ideas, but spent almost all of the energy on the scraping/data organizing part so far. Good news: I have 5 years of full text data from the German news website now. It is a lot of text and will allow for a lot of mining :-)

Any changes in direction or topic?

Yes. Thanks to Adrian's suggestion, I will now make a thorough analysis of the German text corpus I got and compare it to the results of the New York Times.

Problems/Questions

The tokenizing is a problem. Also the attribution of male/female words. Lot of manual work.

Checklist

[ ] I have included my visuals
[x] I have filled out the sections above
[ ] I have been updating my DIARY.md with details about my process
[x] I have uploaded/updated any Jupyter Notebooks or other datasets into my code repository

sarahslo commented 6 years ago

what if you divided the data by the gender of the writer? you could then see if men and women write differently about men and women. you wouldn't have to compare one org to another, you could compare it to itself. you could also compare it to the nyt if you've started that. but divide that by gender as well.

jlstro commented 6 years ago

Final

Project visuals/text

TEXT IS WORK IN PROGRESS

The project lives here

These are the charts I came up with: grafik grafik grafik

Details

Headline:

Published website version:

Code repository: It's here and will still be cleaned and updated over the weekend Final data set(s): too large to upload

What did you find to be the most difficult part of this project?

I ende up with almost 8 million sentences, each 5 words on average I guess. My computer haunted me with memory errors in pandas when I tried to do more fancy stuff such as tf/idf so I ended up with a relatively simple count/percentage.

Are you satisfied with what you produced? Is there anything you would like to change or improve?

Not really happy with it. Maybe I'll put some more work into it at a later stage. It needs more cleaning and I'd also like to tackel the change over time analysis if I had more time.

Checklist

[x] I have included my visuals
[x] I have posted my project to my project website
[ ] I have been updating my DIARY.md with details about my process
[ ] I have uploaded/updated any Jupyter Notebooks or other datasets into my code repository

jsoma / data-studio-projects

[Projekt] Gender bias in news reporting #225

Pitch

Summary

Details

Work so far

Checklist

Update

Your project content: images/words/etc

Any changes in direction or topic?

Problems/Questions

Checklist

Final

Project visuals/text

Details

What did you find to be the most difficult part of this project?

Are you satisfied with what you produced? Is there anything you would like to change or improve?

Checklist