tattle-made / kosh-v2

3 stars 4 forks source link

Write a bot to add media from the factcheck article dataset to kosh #7

Closed dennyabrain closed 2 years ago

dennyabrain commented 2 years ago

@tarunima can you paste one item from the fact check article database here so I can discuss what the shape of the data is?

dennyabrain commented 2 years ago

@tarunima One thing I don't have a good idea about is, how do we keep track of which posts in the factcheck database have been added to Kosh. I remember seeing an onPortal boolean in the data, did you use that to ensure no post was added to Kosh twice?

tarunima commented 2 years ago

@tarunima One thing I don't have a good idea about is, how do we keep track of which posts in the factcheck database have been added to Kosh. I remember seeing an onPortal boolean in the data, did you use that to ensure no post was added to Kosh twice?

yes onPortal was created for that purpose. But we probably need to reset that field if we are indexing all the data fields again?

tarunima commented 2 years ago

@tarunima can you paste one item from the fact check article database here so I can discuss what the shape of the data is?

{
  "_id": {
    "$oid": "5db02a1888ab2b22f0a5942e"
  },
  "postID": "1a5ad9881e3e42978bde82472194f000",
  "postURL": "https://www.altnews.in/madhu-kishwar-tweets-photoshopped-image-of-amul-ad-targeting-gandhi-family/",
  "domain": "altnews.in",
  "headline": "Madhu Kishwar tweets photoshopped image of Amul Ad targeting Gandhi family",
  "date_accessed": "October 23, 2019",
  "date_updated": "September 01, 2019",
  "author": {
    "name": "Jignesh Patel",
    "link": "https://www.altnews.in/author/jignesh/"
  },
  "docs": [
    {
      "doc_id": "0719485be1454eb3bc7bbc71adfa17da",
      "postID": "1a5ad9881e3e42978bde82472194f000",
      "domain": "altnews.in",
      "origURL": "https://www.altnews.in/madhu-kishwar-tweets-photoshopped-image-of-amul-ad-targeting-gandhi-family/",
      "s3URL": null,
      "possibleLangs": [],
      "isGoodPrior": [
        {
          "$numberInt": "0"
        },
        {
          "$numberInt": "0"
        }
      ],
      "mediaType": "text",
      "content": "“बिना कुछ कहे, सब कुछ कह दिया। (Says everything without saying anything -translated)”, reads a tweet by academician and writer Madhu Purnima Kishwar with a photograph of billboard, which showed the trademark Amul girl along with cartoons of Congress leader Rahul Gandhi and Priyanka Gandhi. The billboard had an inscribed message that was targeted at the dynasty politics and allegation of corruptions by the Gandhi family. The message reads, “नाना ने खाया , दादीने खाया , पापा ने खाया , मम्मी ने खाया आओ बहना तुम भी खालो जीजू को भी यहाँ बुला लो (Grandfather ate, Grandmother ate, Father ate, Mother ate and Sister you also eat and also call brother-in-law -translation)”, reads the Hindi text in the billboard. The word ‘खाया (ate)’, contextually refers to corruption.\nPhotoshopped image\nAlt News found that the Ad banner used in the billboard is photoshopped with the Hindi text stated above. A reverse search of the image on Google reveals that there are several images of the same car and billboard with different banners, which goes to suggest that it has been photoshopped.\nMoreover, the Amul Ad banner, which comprises of the iconic Amul girl, Priyanka Gandhi and Rahul Gandhi has also been photoshopped with the Hindi text and the Amul slogan (The Taste of India). Amul had tweeted the original photo of the Amul Ad dedicated to Priyanka Gandhi’s entry into active politics before the 2019 general elections.\nBoom also spoke to daCunha, the advertisement agency behind the campaign and confirmed, “the viral photo was fake and it didn’t come from the agency.” \nIn conclusion, academician Madhu Kishwar tweeted a photoshopped image of Amul Ad, which targeted the Gandhi family portraying it as the official advertisement from the company. In the past as well, the writer and academician has been found spreading misinformation on several occasions (1, 2, 3, 4). Last December, Kishwar tweeted an old video of a rally with a false claim that Pakistani flags were waved by the Muslim community in celebration of the Congress’ electoral victory in three assembly elections. When it was pointed out that she had posted misinformation, she tweeted another misleading video to defend her last tweet.\nDonate Now\nEnter your email address to subscribe to Alt News and receive notifications of new posts by email.\nSend this to a friend",
      "nowDate": "October 23, 2019"
    },
    {
      "doc_id": "5eb1d53cfc7c4c8db993f9d7428a9b76",
      "postID": "1a5ad9881e3e42978bde82472194f000",
      "domain": "altnews.in",
      "origURL": "http://www.altnews.in/wp-content/uploads/2019/08/2019-08-31-22_54_36-Google-Search.png",
      "s3URL": "https://tattle-story-scraper.s3.ap-south-1.amazonaws.com/b0ec8300-24d9-4380-8420-ef936ff616e3",
      "possibleLangs": [],
      "isGoodPrior": [
        {
          "$numberInt": "0"
        },
        {
          "$numberInt": "0"
        }
      ],
      "mediaType": "image",
      "content": null,
      "nowDate": "October 23, 2019",
      "onPortal": true
    }
  ]
}
tarunima commented 2 years ago

For the purposes of this sprint, we are simplifying this task so that we only upload and index a one-time csv/json dump of the fact checking sites dataset (till September 2021).

tarunima commented 2 years ago

Writing a cron job for frequent indexing of the mongo database is deferred for later.

dennyabrain commented 2 years ago

@tarunima can you finalize the fields that need to be added to kosh?

My hunch is that for every doc item, we need to store with it, its headline, postURL and domain

tarunima commented 2 years ago

@tarunima can you finalize the fields that need to be added to kosh?

My hunch is that for every doc item, we need to store with it, its headline, postURL and domain

These ones: "postURL": "domain": "headline": "date_updated":

gpstrange commented 2 years ago

After inserting post into kosh DB, we add foreign key field "e_kosh_id" (foreign key of id in kosh DB) inside stories.docs in metadata db. We use this "e_kosh_id" field to check whether a post is added to kosh db.