(Dynamic) updates of source list

glciampaglia commented 6 years ago

Background. So far Hoaxy has been collecting tweets that include links to a pre-defined list of source domains. To do so, each domain is included as a keyword for the POST/statuses endpoint of the Twitter streaming API.

Problem. This list can be only updated manually, and does not take into account that not all domains may generate traffic (domains may go offline), and that one may want to prioritize domains that appear in a) multiple lists and b) generate more traffic.

Solution. To overcome these limitations, a new cron job will be added that estimates the amount of tweets for each domain, using a call to the search API. We will also add a table that keeps track of multiple lists of websites, in order to get an idea of how much consensus there is about individual source domains. Finally, another cron job will select only lists with minimum consensus, rank them by the estimated traffic, and update the tweet collection filter accordingly.

Tasks. The following tasks are needed:

[x] Script to estimate the weekly traffic of each source, to be run as a cron job
[ ] Table keeping track of each source domain and in how many lists of fact-checkers it is included, and whether the source is "enabled" for data collection, and tags from each source (this last bit needs an extra table).
[ ] Script to update the list of sources based on the data from the traffic estimation script.

filmenczer commented 6 years ago

@shaochengcheng any updates?

filmenczer commented 6 years ago

Also consider list from Veracity.ai?

glciampaglia commented 6 years ago

We discussed the issue again.

We agreed to use a threshold of two lists as the consensus.

We also discussed the problem with the cap limit. Currently we collect between 100K ad 300K tweets every day (see query below). If we increase the amount of sources, we need to be careful not getting into the 1% cap limit of the API. Currently the cap is approximately 3M tweets a day. So we on top of ranking the sources with minimum consensus, we need to estimate the traffic produced by the websites. To do so we need to update the script that Chengcheng has been running for statistical purposes to estimate the traffic rate of each source. To do so, Chengcheng will start saving also the timestamps of the tweets.

hoaxy=> select count(id), date(created_at) from tweet group by date(created_at) order by date(created_at) desc limit 20;
 count  |    date    
--------+------------
  60929 | 2018-07-05
 121934 | 2018-07-04
 138735 | 2018-07-03
 159723 | 2018-07-02
 110107 | 2018-07-01
 113922 | 2018-06-30
 148491 | 2018-06-29
 156383 | 2018-06-28
 187560 | 2018-06-27
 164116 | 2018-06-26
 175547 | 2018-06-25
 164754 | 2018-06-24
 164277 | 2018-06-23
 271310 | 2018-06-22
 296879 | 2018-06-21
 210968 | 2018-06-20
 215753 | 2018-06-19
 162942 | 2018-06-18
 112579 | 2018-06-17
 114218 | 2018-06-16
(20 rows)

ZacMilano commented 5 years ago

We will have to revisit the policy in light of the fact that we do not currently have a well-maintained list of claim sources. @glciampaglia will look for a possible source of sources from a Knight Prototype group (veracity.ai).

yang3kc commented 5 years ago

Possible source of misinformation sites:

claim review database built by Zoher.
from first draft, https://www.newstracker.org (need to access through slack)
reproduce the snowballing method from first draft using our own Twitter data
contact veracity.ai or ask Knight Foundation to mediate

glciampaglia commented 5 years ago

I was given access to veracity.ai and can share API keys if needed.

On Thu, Dec 6, 2018, 12:37 Kaicheng(Kevin) Yang <notifications@github.com wrote:

Possible source of misinformation sites:

claim review database built by Zoher.

from first draft, https://www.newstracker.org (need to access through slack)

reproduce the snowballing method from first draft using our own Twitter data

contact veracity.ai or ask Knight Foundation to mediate

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IUNetSci/hoaxy-backend/issues/5#issuecomment-444960953, or mute the thread https://github.com/notifications/unsubscribe-auth/ABb-LD5ROosaPmF5-H-4X_wIRwt5P6T_ks5u2VXzgaJpZM4Sfmqy .