Closed glciampaglia closed 3 years ago
@shaochengcheng any updates?
Also consider list from Veracity.ai?
We discussed the issue again.
We agreed to use a threshold of two lists as the consensus.
We also discussed the problem with the cap limit. Currently we collect between 100K ad 300K tweets every day (see query below). If we increase the amount of sources, we need to be careful not getting into the 1% cap limit of the API. Currently the cap is approximately 3M tweets a day. So we on top of ranking the sources with minimum consensus, we need to estimate the traffic produced by the websites. To do so we need to update the script that Chengcheng has been running for statistical purposes to estimate the traffic rate of each source. To do so, Chengcheng will start saving also the timestamps of the tweets.
hoaxy=> select count(id), date(created_at) from tweet group by date(created_at) order by date(created_at) desc limit 20;
count | date
--------+------------
60929 | 2018-07-05
121934 | 2018-07-04
138735 | 2018-07-03
159723 | 2018-07-02
110107 | 2018-07-01
113922 | 2018-06-30
148491 | 2018-06-29
156383 | 2018-06-28
187560 | 2018-06-27
164116 | 2018-06-26
175547 | 2018-06-25
164754 | 2018-06-24
164277 | 2018-06-23
271310 | 2018-06-22
296879 | 2018-06-21
210968 | 2018-06-20
215753 | 2018-06-19
162942 | 2018-06-18
112579 | 2018-06-17
114218 | 2018-06-16
(20 rows)
We will have to revisit the policy in light of the fact that we do not currently have a well-maintained list of claim sources. @glciampaglia will look for a possible source of sources from a Knight Prototype group (veracity.ai).
Possible source of misinformation sites:
I was given access to veracity.ai and can share API keys if needed.
On Thu, Dec 6, 2018, 12:37 Kaicheng(Kevin) Yang <notifications@github.com wrote:
Possible source of misinformation sites:
- claim review database built by Zoher.
- from first draft, https://www.newstracker.org (need to access through slack)
- reproduce the snowballing method from first draft using our own Twitter data
- contact veracity.ai or ask Knight Foundation to mediate
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IUNetSci/hoaxy-backend/issues/5#issuecomment-444960953, or mute the thread https://github.com/notifications/unsubscribe-auth/ABb-LD5ROosaPmF5-H-4X_wIRwt5P6T_ks5u2VXzgaJpZM4Sfmqy .
@benabus will follow up with Cameron (cameron_hickey@...edu) about getting access to list of sources on newstracker.org
Also we can use the crowdtangle API for engagement (need access via First Draft)
See dump of one month of shared URLs from NewsTracker; can we use this kind of data to get a super-list of domains?
Cameron will provide URL for access to list of domains from NewsTracker. @chathuriw will be point of contact.
Updates:
We got the NewsTracker API link. JSON endpoint returns top 500 domains updated in the last month, ordered by when they were last updated. The data shared also includes the date the domain was added to NewsTracker, as well as the domain’s registration date if available.
We heard back from Danny Rogers of Veracity.ai/Global Disinformation Index about their list of junk news domains. The current database is the initial seed list of domains from PolitiFact, Le Monde, Opensources.co, Storyful, NewsTracker, and one or two others. "Soon" (in the next few months) they plan to (1) provide more granular tags/descriptor and (2) add 10-20k new domains to be triaged through various classifiers they're building.
Based on this, I think we should proceed to build the pipeline assuming that we will have a long list that can be updated periodically, and then must be pruned based on our original criteria: sites that are reliably labeled as low-credibility and that generate a lot of Twitter traffic.
Let us also look at the UnNews index from Poynter's IFCN: https://www.poynter.org/ifcn/unreliable-news-index/
This was later retracted
Need more discussion of our options, such as using veracity.ai or implementing our own 'crawler' similar to newstracker...
One element to consider is that using a 3rd party source is preferable.
We also may want to change the software so that some websites (fact-checking) are read-only. For example, one can only select fact-checking sources vetted by Poynter's IFCN.
Fil spoke with German Marshall Fund; they may be interested in supporting a research project on this.
In the long term we could explore various approaches, including:
In the short term, we will make the "Twitter" button the default on the front end.
Update:
In the short term, @yangkcatiu will update the list manually on a one-time basis from recent literature (see #54)
In the longer term, @btrantruong will look at automating the process by scanning the decahose with botometer-lite to detect suspicious sources
We should explore the possibility of licensing the list from https://www.newsguardtech.com/ and automatically update our list based on theirs.
And/or https://iffy.news/index/ (free, only includes low-rated sources)
@yangkcatiu will explore the possible use of iffy.news index.
We decided to use the new iffy+ index that will be maintained by Barrett Golding at iffy.news. It will be available on a Google Sheet that we can access via JSON or CSV. It will be updated every few months. How the list is constructed will be clearly documented on the iffy.news website.
We will develop a monthly cron job that gets the iffy+ list and updates our Hoaxy list.
We will update the FAQ to document the source of the list and give credit to Iffy.news.
@yangkcatiu Does iffy.news only contain claims sites? Or does it contain fact-checking site as well?
Only low-credibility ('claim') sources. We can keep the current fact-checking sources. (Maybe we should expand it?)
Considering that there are not many fact-checking sources, maybe we can expand the list too.
How was it done last time?
Some good lists:
https://ifcncodeofprinciples.poynter.org/signatories https://reporterslab.org/fact-checking/
These include non-US, non-English sources too, so they need to be filtered.
A tentative Iffy+ list is available on Google sheet. I just shared it with @chathuriw . Let me know if anyone else needs access.
The hoaxy site --load-domain...
command now allows to update the source list(s) (see #55). Each instance of Hoaxy may use this to update the sources that are tracked. Note that after running this command, the Twitter stream needs to be restarted.
For the IU instance of Hoaxy, we will do this periodically based on a third-party list, such as iffy+.
Background. So far Hoaxy has been collecting tweets that include links to a pre-defined list of source domains. To do so, each domain is included as a keyword for the
POST/statuses
endpoint of the Twitter streaming API.Problem. This list can be only updated manually, and does not take into account that not all domains may generate traffic (domains may go offline), and that one may want to prioritize domains that appear in a) multiple lists and b) generate more traffic.
Solution. To overcome these limitations, a new cron job will be added that estimates the amount of tweets for each domain, using a call to the search API. We will also add a table that keeps track of multiple lists of websites, in order to get an idea of how much consensus there is about individual source domains. Finally, another cron job will select only lists with minimum consensus, rank them by the estimated traffic, and update the tweet collection filter accordingly.
Tasks. The following tasks are needed: