tattle-made / factchecking-sites-scraper

A repo to store helper functions for scraping + experiments/visualisations
GNU General Public License v3.0
2 stars 7 forks source link

Proposal for a new way to structure Scrapers #10

Closed dennyabrain closed 2 years ago

dennyabrain commented 3 years ago

I have identified the following problem in the existing structure

  1. Large modules that do a lot of tasks intertwined with each other that make for very hard debugging
    1. Debugging issues in one module often involves running the scraper multiple times which leads to downloading data multiple times leading to a very inefficient time consuming debugging
  2. Unnecessary dependency on database making it hard to develop and debug locally
  3. Lack of historical record of the various steps in scraping. Making it hard to investigate issues and where things went wrong during the course of operation once you discover a bug after weeks/months

When I reviewed the code and looking at some of the other scrapers we have made since, I've noticed some clear boundaries of separation between tasks which I want to encode in the code. I am just going to propose some Entities and what they should be responsible for. I leave the implementation or ways of organizing the code to the developer (though we can discuss that if you want). As long as we can trace back the Entities mentioned here to some unit in the code, it should be fine.

Scraper

We define a Scraper as a unit of code that is run periodically, which finds new articles that need to be scraped, downloads the data from those articles and saves it in Tattle's storage.

Every Scraper has a Domain and each Domain has an associated Language.

Scrapers have the following entities

Crawler

This is the unit responsible for discovering URLs that need to be scraped. In the case of fact checking websites this involves finding the URLs of every newly published article since the last time our scraper run. Time of last scraping can be stored locally in a file or on a remote mongo. Crawlers should be configurable on where should they look for this time.

ArticleDownloader

For every article discovered in the previous step, we need to download the raw data returned by the server. This needs to be saved in a persistent manner. This will help us have an original copy of the data returned by the fact checking website.

ArticleParser

This parses the raw data downloaded in the previous step and extracts data of interest, like author name, publication date, article text etc. This entitiy is also responsible for extracting relevant URLs of embedded media like images and videos at the least and tweets, youtube video link and fb links if possible.

EmbeddedMediaDownloader

The parsed data might container URL to things like files and tweets etc, This entity is responsible for downloading them. Downloading images and videos from their server is an essential functionality task. Its ok to skip tweets, youtube video etc. This entity is also responsible for updating the Parsed data in the previous step with updated information about downloaded files/media.

DataFormatter

It takes the parsed data from the previous two steps and structures it in a form that’s suitable for storing in our db. For backwards compatibility reasons we have to store two redundant copies of the data with slightly different structures.

DataUploader

Uploads the formatted data to remote mongo and the downloaded files to an s3 bucket.

Additional Notes

Each scraper needs a mode that can be 'local' or 'remote'.

  1. in the local mode, the files are downloaded to the local fs and data is stored in a local .json file
  2. in the remote mode, files are uploaded to our s3 bucket and data is stored in our mongoldb
tarunima commented 3 years ago

this is a fantastic breakdown. Wanted to flag, for article parser- sites could have different data points that could be scraped. Some sites organize content by type of story (politics, health, 'viral'). This would be a good data point to store. We shouldn't be limited by the fields in the existing dataset.

tarunima commented 3 years ago

Specifically for Claims Review, we should explore whether the more convenient strategy is to pull it from the sites itself or to pull from the Claims Review API (https://toolbox.google.com/factcheck/apis). This is definitely a long term task and not to be done this sprint.

dennyabrain commented 3 years ago

+1 to the point about article parser. each scraper (which corresponds to a website) will most probably have its own article parser because we might be able to extract different fields depending on whats available to scrape.

Not sure how important this difference is to capture in the code but i feel like most sites have some common fields (date of publication, author name, headline) and some fields unique to that website. the one that comes to mind is that for every article that vishwas news has they also have a dedicated place on the site where they say if the claim is misleading, false or true. so that could be useful to capture. so i think it should be easy to look at code of article parser for a scraper and tell what field it will extract. which are standard fields and which are specific to this scraper only. This will def also make it very easy to catch exceptions if the structure of the website changes.

dennyabrain commented 3 years ago

about your point wrt ClaimsReview, since the claim review data is already on the webpage, it makes sense to do it within the article parser itself no? given the prominece of ClaimsReview, we should define a special Entity (maybe ClaimsReviewParser) within ArticleParser to extract claims review fields.

i was looking at an altnews article and see this clean claims review section in the html :

{
  "@context": "http://schema.org",
  "@type": "ClaimReview",
  "datePublished": "2020-10-08",
  "url": "https://www.altnews.in/fact-check-were-rahul-gandhi-and-priyanka-gandhi-cought-laughing-on-way-to-meet-hathras-victim-family/",
  "itemReviewed":
  {
    "@type": "CreativeWork",
    "author":
    {
      "@type": "Person",
      "name": "Anima Sonkar"      

    },
    "datePublished": "2016-06-20"
  },
  "claimReviewed": "Rahul and Priyanka Gandhi laughing on their way to Hathras",
  "author":
  {
    "@type": "Organization",
    "name": "Alt News",
        "url" : "https://www.altnews.in"
       },
  "reviewRating":
  {
    "@type": "Rating",
    "ratingValue": "1",
    "bestRating": "5",
    "worstRating": "1",
    "alternateName" : "False"
  }
}

pretty straightforward to extract data from. probably an overkill to make API calls to google's endpoints for this?

also my assumption is that all fact checkers who do implement claims review on their site will have such a well structured data in their webpage.

duggalsu commented 3 years ago

Hi,

I have implemented the pipeline so that each task can run independently once previous tasks complete. This allows us to

Here is the output of each step -

  1. Crawler

    • ../temp_pipeline_output/crawledurls\<domain>.pkl - pickle file containing a list of crawled urls Note: this is needed by the Article Downloader
    • scrape_time.json - contains the last scrape time a dict { url: time_in_millis } Note: scraping by specified time is not fully implemented yet
  2. Article Downloader

    • ../temp_pipeline_output/downloadedarticles\<domain>.pkl - pickle file containing a dict of url and downloaded file path { url: filepath } Note: this is needed by the Parser and Data Uploader and is deleted by the Data Uploader
    • ../data/raw/\<domain>/.html - downloaded raw html Note: The crawler output file is deleted at this step as future steps are not dependent on it
  3. Parser

    • ../temp_pipeline_output/parsedarticles\<domain>.pkl - pickle file containing the following text - "Parse Success!" Note: This is only used for failure recovery message when running the Article Downloader. This is deleted by the Data Uploader.
    • Writes parsed output to db Note: This will contain updated s3 URLs after Data Uploader is run
      {'_id': ObjectId('5f88361a71ff1e52a7b82a45'),
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'postURL': 'https://www.altnews.in/goa-congress-member-falsely-branded-as-naxal-bhabhi-in-hathras-case/',
      'domain': 'altnews.in',
      'headline': 'Goa Congress member falsely branded as ‘Naxal Bhabhi’ in Hathras case',
      'date_accessed': 'October 15, 2020',
      'date_accessed_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000),
      'date_updated': 'October 14, 2020',
      'date_updated_UTC': datetime.datetime(2020, 10, 14, 0, 0),
      'author': {'name': 'Pooja Chaudhuri',
      'link': 'https://www.altnews.in/author/pooja/'},
      's3URL': None,
      'post_category': None,
      'claims_review': None,
      'docs': [{'doc_id': 'ce3ea640b04d4182a129b2c18835ff7d',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://www.altnews.in/goa-congress-member-falsely-branded-as-naxal-bhabhi-in-hathras-case/',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'text',
      'content': 'A photograph of a woman panellist in a press conference held by Goa Pradesh Congress Committee was shared by Chayan Chatterjee, great-grandson of former Calcutta University vice-chancellor Ashutosh Mukherjee. He claimed that the woman is ‘fake Naxal bhabhi’ who hugged Congress leader Priyanka Gandhi in Hathras. Chatterjee’s tweet drew over 3,000 retweets and more than 7,000 likes.\nOn October 3, Priyanka Gandhi had visited the family of a Dalit woman who died after being allegedly gang-raped by four upper-caste men in Uttar Pradesh’s Hathras. Chatterjee’s post alleges that Gandhi hugged a woman at the victim’s residence who is associated with the Congress but posed as a family member. Renuka Jain and Arun Pudur\xa0claimed the same.\nTwitter user @AshishJaggi_1 identified the woman in the panel as ‘Dr Rajkumari Bansal’ and referred to her as #FakeNaxalBhabhi. The context behind the term will be discussed later in the report.\nSquint Neon, a page that frequently peddles misinformation, posted the same on Facebook and so did one Suresh Kochattil.\nFacebook page The Right Voice also shared the image along with a picture of Gandhi hugging a saree-clad woman.\nIn an earlier fact-check, Alt News had debunked identical false claim however circulating without the photo of the Congress panellist. A picture of Gandhi hugging Hathras victim’s mother was misused to claim the woman was ‘fake Naxal bhabhi’.\nAlt News rummaged through the Facebook page of Goa Congress and found a press conference video where the same woman can be spotted in a black saree. The press conference was held on the Hathras case. The woman has been identified as Pratibha Borkar, social media head of the committee.\nIn the viral image, Borkar and the other panellists are not wearing masks which suggests that this press conference was held prior to at least March 2020. We found that it was uploaded on September 2019.\nBorkar has the same picture currently viral as her cover photo on Facebook.\nNow coming to ‘fake Naxal bhabhi’. The term is used for Dr Rajkumari Bansal who was accused by Hathras police for posing as a relative of the victim’s family and feeding them statements to give to the cops. A News18 report claimed that she was posing as the victim’s bhabhi\xa0(sister-in-law).\xa0Dr Bansal is an assistant professor at the forensic department of Jabalpur medical college. She claimed that her visit was for humanitarian reasons and she wanted to examine documents pertaining to the deceased’s medical treatment considering her forensic expertise. Dr Bansal also said that she was providing the family financial assistance. She was accused by some media outlets of being a Naxalite, a claim that she has denied.\nA quick facial comparison of Dr Bansal and Pratibha Borkar proves they are different people. The most evidently contrasting feature is their nose.\nJabalpur-based Dr Rajkumari Bansal was misidentified as Congress member Pratibha Borkar to claim that a woman associated with the party posed as Hathras victim’s family member whom Priyanka Gandhi hugged. There are more than enough media reports that Gandhi had hugged the victim’s mother.\nDonate Now\nPooja Chaudhuri is a senior editor at Alt News.\nSocial Media Coordinator of Jharkhand Pradesh Congress Sevadal Rameez Raza shared a video on Twitter…\nA new development in the Hathras case came on October 10, when it was reported…\nSeveral social media users have shared an image of two outside-broadcasting (OB) vans severely vandalised….\nA 19-year old Dalit woman who was allegedly gang-raped in Hathras succumbed to injuries on…\nA CCTV footage of a man sitting inside his car mercilessly thrashed by a group…\nAlt News',
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)},
      {'doc_id': 'a1fdc005ebb84de98a14851554dd8773',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://www.youtube.com/embed/RxOgleRfnvM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'video',
      'content': None,
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)},
      {'doc_id': 'b608fa1c122046708fed1634dfefb557',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://i0.wp.com/www.altnews.in/wp-content/uploads/2020/10/899a7a60-51bf-4709-83e0-9d6a7b108edd.jpg?resize=591%2C539',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'image',
      'content': None,
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)},
      {'doc_id': '2c5785f8f00d43c5ba9f69bba8e9406f',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://i0.wp.com/www.altnews.in/wp-content/uploads/2020/10/07c2ef99-d91a-40fc-b00b-6409d9d1979f.jpg?resize=675%2C495',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'image',
      'content': None,
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)},
      {'doc_id': '9524d53150e243e98312a454e5364c78',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://i1.wp.com/www.altnews.in/wp-content/uploads/2020/10/40bfc523-62e1-4f3f-b381-7e0bc6cb3901.jpg?resize=810%2C425',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'image',
      'content': None,
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)},
      {'doc_id': '801660cdc67b4ad59d3294d89ba69600',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://i2.wp.com/www.altnews.in/wp-content/uploads/2020/10/9db5e2d8-4d88-4cac-8ea1-a0a48b5d6d6d.jpg?resize=810%2C370',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'image',
      'content': None,
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)},
      {'doc_id': 'f1df6fc1dd084ba3a5f89262a0cf2975',
      'postID': 'a38678e17c814630ae025cd7283d68a1',
      'domain': 'altnews.in',
      'origURL': 'https://i1.wp.com/www.altnews.in/wp-content/uploads/2020/10/rodeo.jpg?resize=810%2C424',
      's3URL': None,
      'possibleLangs': ['english'],
      'isGoodPrior': [0, 0],
      'mediaType': 'image',
      'content': None,
      'nowDate': 'October 15, 2020',
      'nowDate_UTC': datetime.datetime(2020, 10, 15, 11, 44, 26, 941000)}]}
  4. Embedded Media Downloader

    • ../data/raw/image_dl/\<downloaded images> Note: I have not yet fully implemented downloading videos until we decide if it is appropriate to do so
    • ../temp_pipeline_output/media_dl_image_filename.pkl - pickle file containing a dict - { media_url: [doc, filename] }
  5. Data Uploader

    • Uploads to s3
    • raw articles - from the Article Downloader step
    • media content - from the Embedded Media Downloader step
    • Updates s3 post URL and media URLs in the DB
    • Deletes - transient output of all previous steps
    • Pickle files
    • Articles
    • Media content

Hope this makes sense!

dennyabrain commented 3 years ago

@duggalsu just a clarification regarding "Note: I have not yet fully implemented downloading videos until we decide if it is appropriate to do so" if the video is hosted on the fact checking website itself then we should download it. if its hosted on something like youtube then we can defer on it.

@tarunima @kruttikanadig can we do a preliminary analysis on seeing what is the norm for fact checkers when they fact check a video, do they have the original video on their own website or do they host it on their youtube account or link to someone else's youtube ?

dennyabrain commented 3 years ago

Thank you @duggalsu! what you've written makes sense. eager to see it in action :)

tarunima commented 3 years ago

@dennyabrain yes all fact checking sites should contain that structured snippet. Will go through a couple of fact checking sites to see what convention on video is.

tarunima commented 2 years ago

The scraper was implemented using the five entities but the linking of entities is different from that suggested. Current implementation in: https://github.com/tattle-made/tattle-research/tree/master/scraper_v3