osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

(Dynamic) updates of source list #5

Closed glciampaglia closed 3 years ago

glciampaglia commented 6 years ago

Background. So far Hoaxy has been collecting tweets that include links to a pre-defined list of source domains. To do so, each domain is included as a keyword for the POST/statuses endpoint of the Twitter streaming API.

Problem. This list can be only updated manually, and does not take into account that not all domains may generate traffic (domains may go offline), and that one may want to prioritize domains that appear in a) multiple lists and b) generate more traffic.

Solution. To overcome these limitations, a new cron job will be added that estimates the amount of tweets for each domain, using a call to the search API. We will also add a table that keeps track of multiple lists of websites, in order to get an idea of how much consensus there is about individual source domains. Finally, another cron job will select only lists with minimum consensus, rank them by the estimated traffic, and update the tweet collection filter accordingly.

Tasks. The following tasks are needed:

filmenczer commented 6 years ago

@shaochengcheng any updates?

filmenczer commented 6 years ago

Also consider list from Veracity.ai?

glciampaglia commented 6 years ago

We discussed the issue again.

We agreed to use a threshold of two lists as the consensus.

We also discussed the problem with the cap limit. Currently we collect between 100K ad 300K tweets every day (see query below). If we increase the amount of sources, we need to be careful not getting into the 1% cap limit of the API. Currently the cap is approximately 3M tweets a day. So we on top of ranking the sources with minimum consensus, we need to estimate the traffic produced by the websites. To do so we need to update the script that Chengcheng has been running for statistical purposes to estimate the traffic rate of each source. To do so, Chengcheng will start saving also the timestamps of the tweets.

hoaxy=> select count(id), date(created_at) from tweet group by date(created_at) order by date(created_at) desc limit 20;
 count  |    date    
--------+------------
  60929 | 2018-07-05
 121934 | 2018-07-04
 138735 | 2018-07-03
 159723 | 2018-07-02
 110107 | 2018-07-01
 113922 | 2018-06-30
 148491 | 2018-06-29
 156383 | 2018-06-28
 187560 | 2018-06-27
 164116 | 2018-06-26
 175547 | 2018-06-25
 164754 | 2018-06-24
 164277 | 2018-06-23
 271310 | 2018-06-22
 296879 | 2018-06-21
 210968 | 2018-06-20
 215753 | 2018-06-19
 162942 | 2018-06-18
 112579 | 2018-06-17
 114218 | 2018-06-16
(20 rows)
ZacMilano commented 5 years ago

We will have to revisit the policy in light of the fact that we do not currently have a well-maintained list of claim sources. @glciampaglia will look for a possible source of sources from a Knight Prototype group (veracity.ai).

yang3kc commented 5 years ago

Possible source of misinformation sites:

glciampaglia commented 5 years ago

I was given access to veracity.ai and can share API keys if needed.

On Thu, Dec 6, 2018, 12:37 Kaicheng(Kevin) Yang <notifications@github.com wrote:

Possible source of misinformation sites:

  • claim review database built by Zoher.
  • from first draft, https://www.newstracker.org (need to access through slack)
  • reproduce the snowballing method from first draft using our own Twitter data
  • contact veracity.ai or ask Knight Foundation to mediate

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IUNetSci/hoaxy-backend/issues/5#issuecomment-444960953, or mute the thread https://github.com/notifications/unsubscribe-auth/ABb-LD5ROosaPmF5-H-4X_wIRwt5P6T_ks5u2VXzgaJpZM4Sfmqy .

filmenczer commented 5 years ago

@benabus will follow up with Cameron (cameron_hickey@...edu) about getting access to list of sources on newstracker.org

Also we can use the crowdtangle API for engagement (need access via First Draft)

filmenczer commented 5 years ago

See dump of one month of shared URLs from NewsTracker; can we use this kind of data to get a super-list of domains?

filmenczer commented 5 years ago

Cameron will provide URL for access to list of domains from NewsTracker. @chathuriw will be point of contact.

filmenczer commented 5 years ago

Updates:

Based on this, I think we should proceed to build the pipeline assuming that we will have a long list that can be updated periodically, and then must be pruned based on our original criteria: sites that are reliably labeled as low-credibility and that generate a lot of Twitter traffic.

filmenczer commented 5 years ago

Let us also look at the UnNews index from Poynter's IFCN: https://www.poynter.org/ifcn/unreliable-news-index/

This was later retracted

filmenczer commented 5 years ago

Need more discussion of our options, such as using veracity.ai or implementing our own 'crawler' similar to newstracker...

One element to consider is that using a 3rd party source is preferable.

We also may want to change the software so that some websites (fact-checking) are read-only. For example, one can only select fact-checking sources vetted by Poynter's IFCN.

filmenczer commented 5 years ago

Fil spoke with German Marshall Fund; they may be interested in supporting a research project on this.

filmenczer commented 4 years ago

In the long term we could explore various approaches, including:

In the short term, we will make the "Twitter" button the default on the front end.

filmenczer commented 4 years ago

Update:

In the short term, @yangkcatiu will update the list manually on a one-time basis from recent literature (see #54)

In the longer term, @btrantruong will look at automating the process by scanning the decahose with botometer-lite to detect suspicious sources

filmenczer commented 4 years ago

We should explore the possibility of licensing the list from https://www.newsguardtech.com/ and automatically update our list based on theirs.

And/or https://iffy.news/index/ (free, only includes low-rated sources)

filmenczer commented 3 years ago

@yangkcatiu will explore the possible use of iffy.news index.

filmenczer commented 3 years ago

We decided to use the new iffy+ index that will be maintained by Barrett Golding at iffy.news. It will be available on a Google Sheet that we can access via JSON or CSV. It will be updated every few months. How the list is constructed will be clearly documented on the iffy.news website.

We will develop a monthly cron job that gets the iffy+ list and updates our Hoaxy list.

We will update the FAQ to document the source of the list and give credit to Iffy.news.

chathuriw commented 3 years ago

@yangkcatiu Does iffy.news only contain claims sites? Or does it contain fact-checking site as well?

filmenczer commented 3 years ago

Only low-credibility ('claim') sources. We can keep the current fact-checking sources. (Maybe we should expand it?)

yang3kc commented 3 years ago

Considering that there are not many fact-checking sources, maybe we can expand the list too.

How was it done last time?

glciampaglia commented 3 years ago

Some good lists:

https://ifcncodeofprinciples.poynter.org/signatories https://reporterslab.org/fact-checking/

These include non-US, non-English sources too, so they need to be filtered.

yang3kc commented 3 years ago

A tentative Iffy+ list is available on Google sheet. I just shared it with @chathuriw . Let me know if anyone else needs access.

filmenczer commented 3 years ago

The hoaxy site --load-domain... command now allows to update the source list(s) (see #55). Each instance of Hoaxy may use this to update the sources that are tracked. Note that after running this command, the Twitter stream needs to be restarted.

For the IU instance of Hoaxy, we will do this periodically based on a third-party list, such as iffy+.