scrapinghub project for scraping news from https://www.gismeteo.ua/news/
Project is writen on Python 3 (tested on 3.5).
It have only one spider - gismeteo
. It scrapes urls (that weren't scraped yet),
and then follows them and scrapes articles to Google Drive Sheet.
Follow steps from this page
to generate .json
key and save it to scrapy_climate
folder with client-secret.json
name.
Connection to Scrapy Cloud via GitHub mustn't work because it doesn't support Python 3.
So you need to use shub
command line tool.
Define your project id for each shub deploy
call, or edit scrapinghub.yml
.
To run spider you need to create options.json
in scrapy_climate
folder.
Use JFON format to define variables like:
{
"SCRAPY_CLOUD_API_KEY": "<scrapy_cloud_api_key>",
"SCRAPY_CLOUD_PROJECT_ID": "<scrapy_cloud_project_id>",
"SPREADSHEET_TITLE": "<google_drive_spreadsheet_title>",
"SPIDER_TO_WORKSHEET_DICTIONARY": {
"gismeteo": 1
"<spider>": <worksheet_id_starting_from_zero>
}
}
This file is ignored by git, but will be deployed to ScrapingHub.
Pipeline gives items to StorageMaster that
appends them to defined in spider arguments
Google Drive Sheet ordered by
url, header, tags and body of article to worsheet that defined for currently
running spider in options.json
worksheet (so it must be created before, or spider will raise an RuntimeError), and when
all items where added, master ends his work with a row that contains
url to job on ScrapingHub, CPU datetime, number of scraped articles
and two -----
strings.
When spider scrapes news
page, first of all it fetches indexes
list of scraped
articles from last week using Scrapy Cloud API. Then spider iterates over
links to articles and scrapes only articles that aren't in the indexes
list.
In scrapy_climate/spider.py
Python module it is TemplateSpider
class
which can be use used as parent for actually running spiders. To make new
spider you will need to configure it's selectors and define name, domain,
and relative path to first page. More about it in the docstrings.