medianeuroscience / icore

This project introduces the interface for Communication Research (iCoRe) to access, explore, and analyze the Global Database of Events, Language and Tone (GDELT; Leetaru & Schrodt, 2013). GDELT provides a vast, open source, and constantly updated repository of online news and event metadata collected from tens of thousands of news outlets around the world. Despite GDELT’s promise for advancing communication science, its massive scale and complex data structures have hindered efforts of communication scholars aiming to access and analyze GDELT. We thus developed iCoRe, an easy-to-use web interface that (a) provides fast access to the data available in GDELT (b) shapes and processes GDELT for theory-driven applications within communication research and (c) enables replicability through transparent query and analysis protocols.
https://icore.medianeuroscience.org
8 stars 2 forks source link

Whitelist documentation and missing news sites #31

Open profkenm opened 4 years ago

profkenm commented 4 years ago

Hello and thanks for Icore!

I have a couple of comments about the whitelisted GKG sites on Icore.

First, the whitelist documentation is mentioned on the Icore site but I can't find the actual documentation anywhere. This would be necessary for researchers to scrutinize how, and preferably which, GKG sites were blacklisted.

Second, (maybe related, maybe not) after a few general searches, I noticed that some news sites I expected to find in the U.S. (such as ABC News and the Wall Street Journal) are not present, but the fake news site infowars.com is there, and so is the controversial (maybe fake?) site zerohedge.com. Also controversial sites such as breitbart and bizpacreview.com.

For the purposes of research, all of them should all be included. (Indeed, as you know, some of the first published political communication research that uses GDELT deals with fake news). Might the omitted mainstream news sites have been accidentally blacklisted by Icore or, rather, did GDELT fail to scrape them? How would we know?

musainayatmalik commented 4 years ago

Hi Ken,

Thanks for using icore and for your feedback!

We provide a short description of our source selection logic at http://icore.mnl.ucsb.edu/whitelist. These sites are also uniquely associated with a political bias score (controversial as that idea is itself and biased as this particular metric maybe) derived from Media Bias Monitor. There is also a whitelist table at the bottom that you can type a source name into and therefore check if icore whitelists it or not.

Good point on ABC News. I checked April 2020 and it does seem that 'abcnews.go.com' is not being included. This is worth fixing. Thanks for pointing it out!

On that note, we do plan on extending our source whitelist beyond what we have currently. This updated, larger source list will be implemented it in our next round of updates. On that note, and keeping in mind the scale of global news data, what metric would you suggest as a user and researcher to determine if a news source is significant enough to be included in our database or not? Any relevant databases we could tap into for this purpose that you know of?

Thanks again for your feedback!

profkenm commented 4 years ago

Thank you Musa. Allow me please to follow up on your comments. I am delighted to hear that you plan to revisit the whitelist algorithm. The summary of the whitelist procedure and validation says that it weeds out 'insignificant' sites; those that are too small or not engaged 'news reporting.' The description says the algorithm has been validated, and that is good to hear, but unless I misunderstood you, it sounds like abcnews.go and wsj.com (and who knows what else) were accidentally blacklisted by the whitelist algorithm?(!!).

If so, it would be helpful, and for some research, probably essential, that researchers have access to the validation, if and procedure, and even the blacklist.

You asked what I would suggest regarding fine tuning the whitelist procedure. This is a challenging question. Is it necessary to blacklist any sites on the GKG? It is a question that has both substantive (what is news?) and practical (thousands of irrelevant hits?) implications. Do "small" include local news? If so, local news might be of interest to researchers. Does it omit fake news? Hopefully not. Two of the very few mass comm studies that use gdelt data (Vargo, Guo, and Amazeen 2018 and Guo, Vargo 2020) study fake news content.

Unfortunately, I am not aware databases per se that could help with this. Nor any research (though perhaps there is some) on how to find the news needle in a big data haystack (like the GKG). Addressing that question with GDELT would make for an interesting mass comm methods paper/article. That is, how to weed out nonnews sites from news sites while still casting a wide net? At some point soon, if not now, that research will be necessary.

Thank you again for your hard work on this. And thank you for Icore.

On Fri, May 22, 2020 at 6:30 AM musainayatmalik notifications@github.com wrote:

Hi Ken,

Thanks for using icore and for your feedback!

We provide a short description of our source selection logic at http://icore.mnl.ucsb.edu/whitelist. These sites are also uniquely associated with a political bias score (controversial as that idea is itself and biased as this particular metric maybe) derived from Media Bias Monitor. There is also a whitelist table at the bottom that you can type a source name into and therefore check if icore whitelists it or not.

Good point on ABC News. I checked April 2020 and it does seem that ' abcnews.go.com' is not being included. This is worth fixing. Thanks for pointing it out!

On that note, we do plan on extending our source whitelist beyond what we have currently. This updated, larger source list will be implemented it in our next round of updates. On that note, and keeping in mind the scale of global news data, what metric would you suggest as a user and researcher to determine if a news source is significant enough to be included in our database or not? Any relevant databases we could tap into for this purpose that you know of?

Thanks again for your feedback!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/medianeuroscience/icore/issues/31#issuecomment-632644445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKDDO3TLBFAUMJOQKXI6I7TRSZO4BANCNFSM4NHPN6TA .

-- Ken Mulligan, Associate Professor Director of Undergraduate Studies Department of Political Science Southern Illinois University 3144 Faner Hall Carbondale, IL 62901

fhopp commented 4 years ago

Hey @profkenm ,

great points, thanks for highlighting that. A couple of notes on the whitelist and why we choose to whitelist.

First, GDELT monitors tens of thousands of sources, many of which do not have any 'news' content (e.g., cars.com). Ingesting all of these sites would result in the same 'big-data' problem that icore tries to mitigate.

The idea is to include a broad selection of news sources per country. As metric for inclusion, we mainly focus on online reach. The fact that certain sites are not part of icore likely means that they are not included in GDELT, but we will spot check the cases you outlined just to be sure. Your example of WSJ is likely due to the payment wall of WSJ itself.

Furthermore, icore is built to address specific research questions that are geared either towards particular countries, regions, or news sites. Indeed, obtaining 'representative' news coverage of a particular country is challenging, but including all sources of a country is not only infeasible from a computational point of view, but likely also not needed (check traditional news research that usually just contrasts ~10 sources at a time). In fact, IMHO, the ~800 sources that icore currently includes provide more than ample opportunity to study news from various perspectives (see our case studies in CCR). If I understand your concerns correctly, you would like to have a 'representative' whitelist for particular countries? This is a great, but labor-intensive idea; perhaps to facilitate this process, we can make the whitelist available as CSV along with metadata (e.g., bias). Would that be helpful to your research? If you told us a bit more about the specific questions you seek to address with icore we could probably facilitate knowledge/features that help you address these questions.

Again, thank you so much for your helpful feedback and comments.

profkenm commented 4 years ago

Thank you for your detailed and helpful explanation, Frederic. Icore's approach to whitelisting does seem reasonable given its goals. And very useful for most research on most news media. On second thought, I think the gist of my concerns earlier was misguided, and perhaps useful mainly for researchers who want either a preselected (country based) representative sample of news sources, or the population of all news sources, by country, (including fake news), which would not fit Icore's mission of making the GDELT data fire hose useful for non-programmers.

On Sun, May 24, 2020 at 5:54 PM Frederic R. Hopp notifications@github.com wrote:

Hey @profkenm https://github.com/profkenm ,

great points, thanks for highlighting that. A couple of notes on the whitelist and why we choose to whitelist.

First, GDELT monitors tens of thousands of sources, many of which do not have any 'news' content (e.g., cars.com). Ingesting all of these sites would result in the same 'big-data' problem that icore tries to mitigate.

The idea is to include a broad selection of news sources per country. As metric for inclusion, we mainly focus on online reach. The fact that certain sites are not part of icore likely means that they are not included in GDELT, but we will spot check the cases you outlined just to be sure. Your example of WSJ is likely due to the payment wall of WSJ itself.

Furthermore, icore is built to address specific research questions that are geared either towards particular countries, regions, or news sites. Indeed, obtaining 'representative' news coverage of a particular country is challenging, but including all sources of a country is not only infeasible from a computational point of view, but likely also not needed (check traditional news research that usually just contrasts ~10 sources at a time). In fact, IMHO, the ~800 sources that icore currently includes provide more than ample opportunity to study news from various perspectives (see our case studies in CCR). If I understand your concerns correctly, you would like to have a 'representative' whitelist for particular countries? This is a great, but labor-intensive idea; perhaps to facilitate this process, we can make the whitelist available as CSV along with metadata (e.g., bias). Would that be helpful to your research? If you told us a bit more about the specific questions you seek to address with icore we could probably facilitate knowledge/features that help you address these questions.

Again, thank you so much for your helpful feedback and comments.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/medianeuroscience/icore/issues/31#issuecomment-633312560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKDDO3UTUFZSIN4D2BDYG3TRTGQTHANCNFSM4NHPN6TA .