tattle-made / DAU

MCA Tipline for Deepfakes
GNU General Public License v3.0
6 stars 0 forks source link

Ingest Fact Check Articles and their media #19

Closed dennyabrain closed 4 months ago

dennyabrain commented 5 months ago

We have to figure out the best way to add data related to fact check articles to feluda. Relevant data that needs to be added are :

  1. Fact check articles and associated metadata - publication name, author, date of publication etc
  2. Actual media items that must match for us to surface the fact check articles.

The way to ingest can be a mix of automated and manual. One way for doing (1) that has been suggested by the fact checkers is to use their RSS feeds to get this data. They are anyways tagging articles relevant to this project as "deepfake". Two such feeds are https://www.boomlive.in/feeds/tags/deepfake/feeds.xml and https://factly.in/category/deepfake/feed/

for (2) we can possibly have them add this data to a shared google drive, following a certain convention and then we just pick it up from there. We'll also have to ensure that the mapping between the articles and media item is maintained somehow.

The scope of this issue is to come up with a solution that is sustainable in the long run but also convinient for the fact checkers.

duggalsu commented 5 months ago

The following feed parser PoC lists all the required fields for each item - https://github.com/tattle-made/data-experiments/blob/master/feed_parser/src/main.py

required_fields = {'title', 'author', 'published', 'link', 'content', 'summary',
                           'id'}

The summary field corresponds to description in the xml.

The feed header also contains the following in the channel field that is helpful to help decide if we need to fetch the entire feed again - lastBuildDate. This may also contain the following 2 fields which can be helpful in deciding how frequently to poll the feed - updatePeriod and updateFrequency. We are using the title field within channel as the publication name.

duggalsu commented 5 months ago

The article content does not have claims review information