Closed thisisayush closed 6 years ago
(1) Inefficient Database Management
; How is database inefficient, please throw some light on it!
(2) Agreed with this change.
(3) I have implemented a method to prevent that, review the updated code.
(4) Please read the documentation on python datetime, the process_date function does not have any issues!
(5) I do plan to automate it if possible, please explain the proposed idea.
Table "public.news_table"
Column | Type | Modifiers
----------+-------------------+---------------------------------------------------------
id | integer | not null default nextval('news_table_id_seq'::regclass)
title | character varying | not null
content | character varying | not null
link | character varying | not null
image | character varying | not null
newsdate | character varying | not null
site_id | smallint | not null
Indexes:
"news_table_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"news_table_site_id_fkey" FOREIGN KEY (site_id) REFERENCES site_table(id)
Current Table Schema for news_table :
UNIQUE
Constraint to prevent mistakes of adding duplicates.varchar
while it should be timetz
scraped_date
in tablesite_id
doesn't define ON DELETE
(3) The Change being asked is not be included in Pipeline but in spider itself to prevent it from sending the request to fetch and scrape the url that's already been scraped. The idea is to prevent the request not drop the item.
def process_date(self, itemDate, link, spiderName):
try:
if spiderName is 'indianExpressTech':
processedItemDate = (datetime.strptime(itemDate,"%B %d, %Y %I:%M %p")).strftime("%Y-%m-%dT%H:%M:%S")
elif spiderName is 'moneyControl':
processedItemDate = (datetime.strptime(itemDate,"%B %d, %Y %I:%M %p %Z")).strftime("%Y-%m-%dT%H:%M:%S")
elif spiderName is 'indiaTv':
processedItemDate = (datetime.strptime(itemDate,"%B %d, %Y %H:%M")).strftime("%Y-%m-%dT%H:%M:%S")
elif spiderName is 'zee':
processedItemDate = (datetime.strptime(itemDate,"%b %d, %Y, %H:%M %p")).strftime("%Y-%m-%dT%H:%M:%S")
# elif spiderName is 'ndtv':
# processedItemDate = (datetime.strptime(itemDate, '%A %B %d, %Y')).strftime("%Y-%m-%dT%H:%M:%S")
elif spiderName is 'asianage':
processedItemDate = (datetime.strptime(itemDate,"%d %b %Y %I:%M %p")).strftime("%Y-%m-%dT%H:%M:%S")
elif spiderName is 'News18Spider':
processedItemDate = itemDate.strftime("%Y-%m-%dT%H:%M:%S")
else:
processedItemDate = itemDate
except ValueError as Error:
loggerError.error(Error, link)
processedItemDate = itemDate
finally:
return processedItemDate
The above method defines processing datetime spiderwise to convert them to general format, the proposal includes using `dateutil
library to parse the most known formats and then convert it to general format
So the code reduces to just:
def process_item(self, item, spider):
date_parsed = parser.parse(item['date'], ignoretz=False)
item['date'] = date_parsed.strftime('%Y-%m-%d %H:%M:%S %z')
return item
(3) The Change being asked is not be included in Pipeline but in spider itself to prevent it from sending the request to fetch and scrape the url that's already been scraped. The idea is to prevent the request not drop the item.
Please read the code in the spiders, they do not send unnecessary requests. :)
The above method defines processing datetime spiderwise to convert them to general format, the proposal includes using
`dateutil
library to parse the most known formats and then convert it to general format.If dateutil handles this more efficiently, i would welcome the change. :)
Current Table Schema for news_table: link doesn't has UNIQUE Constraint to prevent mistakes of adding duplicates. newsdate is varchar while it should be timetz Proposed Addition of scraped_date in table
Issue active in #35 ; Please refer this. Foreign Key site_id doesn't define ON DELETE I agree the above mentioned change in database.
Please read the code in the spiders, they do not send unnecessary requests. :)
I have read and then suggested the change :-) Take a look at two snippets below:
# Spider asianage
for url in self.start_urls:
yield scrapy.Request(url, self.parse)
# Spider firsposthindi, firstpostsports, etc included in commit 66b5622bb4bbb09cce9fa99451840446dac37555
self.cursor.execute("""SELECT link from news_table where link= %s """, (link,))
if not self.cursor.fetchall():
yield scrapy.Request(url=link, callback=self.parse_article)
# Hindustan
for somethings in something:
link = somethings.xpath('.//div[contains(@class,"media-body")]/div[1]/a/@href').extract_first()
yield scrapy.Request(url=link, callback=self.fun)
All of these includes either unhandled requests to server or contains a database query embedded in the spider itself (Just think if table names are changed, shit! there are supposed to be 60 spiders man)
How about?
if not NewsDatabase().urlExists(href):
yield scrapy.Request(url = href, callback=self.parse_news)
if not NewsDatabase().urlExists(href): yield scrapy.Request(url = href, callback=self.parse_news)
I Agree with this change.
Issue seems solved, Closing.
While going through ScrapeNews, following issues were detected:
process_date
function defines the processing of date formats for different spiders, instead a library like dateutil can be used which parses most DateTime formats and converts it to python date object which can be then accordingly formatted to send to the databasecustom_settings
can be used to store spider info and get spider details in pipeline.