public-people / scrape-news

Scrape South African news
MIT License
12 stars 13 forks source link

Add Sunday Times spider #19

Closed kblum closed 4 years ago

kblum commented 6 years ago

This pull request adds a spider for the Sunday Times as per the Sunday Times scraper card on the Public People Trello board.

The Sunday Times is hosted on the TimesLIVE website, with https://sundaytimes.co.za redirecting to https://www.timeslive.co.za/sunday-times/.

All of the Sunday Times articles appear to have https://www.timeslive.co.za/sunday-times/ as the base URL. The articles can be found through the standard TimesLIVE sitemap. For example, the https://www.timeslive.co.za/sitemap/business/ sitemap entry has the following Sunday Times articles:

The SundayTimesSpider class is therefore implemented as a sub-class of the TimesliveSpider class but only parsing paths that contain /sunday-times/.

This is my first time working on this project as well as with Scrapy, so please let me know if I have missed something.

jbothma commented 4 years ago

Thanks very much! Apologies about the insane delay!