I've spent the last week reading up on Weka and how to format their .arff training set files. I've started to compile my training with 50 articles relating to women and 50 articles not relating to women that I've been gathering from different news sources, such as The Guardian, The New York Times, USA Today, and the Washington Post (https://github.com/gw-sd-2016/NewsTextAnalysis/commit/f80a71c6b643b63e94363998844a7280d3a43bea). To extract the plain text from the HTML, I've been using a python script that I wrote utilizing the open source article scraping library newspaper.
@cctoombs @twood02
I've spent the last week reading up on Weka and how to format their .arff training set files. I've started to compile my training with 50 articles relating to women and 50 articles not relating to women that I've been gathering from different news sources, such as The Guardian, The New York Times, USA Today, and the Washington Post (https://github.com/gw-sd-2016/NewsTextAnalysis/commit/f80a71c6b643b63e94363998844a7280d3a43bea). To extract the plain text from the HTML, I've been using a python script that I wrote utilizing the open source article scraping library newspaper.