vipulgupta2048 / scrape

Mission - scraping the planet, one website at a time
MIT License
10 stars 13 forks source link

ScrapeNews Issues #34

Closed thisisayush closed 6 years ago

thisisayush commented 7 years ago

While going through ScrapeNews, following issues were detected:

  1. Inefficient Database Management
  2. setupDb.py
    • Database Details are stored in a local file instead of environment
    • Methods of class SetupDb doesn't explicitly provide 'self' argument
  3. Spiders are scraping the urls already scraped (though their data is not being saved, preventing duplicate entries but causing more bandwidth usage)
  4. process_date function defines the processing of date formats for different spiders, instead a library like dateutil can be used which parses most DateTime formats and converts it to python date object which can be then accordingly formatted to send to the database
  5. Entries for spiders are defined in custom files and setupdb.py needs to be run every time a spider is added. Instead other methods like custom_settings can be used to store spider info and get spider details in pipeline.
atb00ker commented 7 years ago

(1) Inefficient Database Management; How is database inefficient, please throw some light on it! (2) Agreed with this change. (3) I have implemented a method to prevent that, review the updated code. (4) Please read the documentation on python datetime, the process_date function does not have any issues! (5) I do plan to automate it if possible, please explain the proposed idea.

thisisayush commented 7 years ago
                               Table "public.news_table"
  Column  |       Type        |                        Modifiers                        
----------+-------------------+---------------------------------------------------------
 id       | integer           | not null default nextval('news_table_id_seq'::regclass)
 title    | character varying | not null
 content  | character varying | not null
 link     | character varying | not null
 image    | character varying | not null
 newsdate | character varying | not null
 site_id  | smallint          | not null
Indexes:
    "news_table_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "news_table_site_id_fkey" FOREIGN KEY (site_id) REFERENCES site_table(id)

Current Table Schema for news_table :

thisisayush commented 7 years ago

(3) The Change being asked is not be included in Pipeline but in spider itself to prevent it from sending the request to fetch and scrape the url that's already been scraped. The idea is to prevent the request not drop the item.

thisisayush commented 7 years ago
def process_date(self, itemDate, link, spiderName):
        try:
            if spiderName is 'indianExpressTech':
                processedItemDate = (datetime.strptime(itemDate,"%B %d, %Y %I:%M %p")).strftime("%Y-%m-%dT%H:%M:%S")
            elif spiderName is 'moneyControl':
                processedItemDate = (datetime.strptime(itemDate,"%B %d, %Y %I:%M %p %Z")).strftime("%Y-%m-%dT%H:%M:%S")
            elif spiderName is 'indiaTv':
                processedItemDate = (datetime.strptime(itemDate,"%B %d, %Y %H:%M")).strftime("%Y-%m-%dT%H:%M:%S")
            elif spiderName is 'zee':
                processedItemDate = (datetime.strptime(itemDate,"%b %d, %Y, %H:%M %p")).strftime("%Y-%m-%dT%H:%M:%S")
            # elif spiderName is 'ndtv':
            #     processedItemDate = (datetime.strptime(itemDate, '%A %B %d, %Y')).strftime("%Y-%m-%dT%H:%M:%S")
            elif spiderName is 'asianage':
                processedItemDate = (datetime.strptime(itemDate,"%d %b %Y %I:%M %p")).strftime("%Y-%m-%dT%H:%M:%S")
            elif spiderName is 'News18Spider':
                processedItemDate = itemDate.strftime("%Y-%m-%dT%H:%M:%S")
            else:
                processedItemDate = itemDate
        except ValueError as Error:
            loggerError.error(Error, link)
            processedItemDate = itemDate
        finally:
            return processedItemDate

The above method defines processing datetime spiderwise to convert them to general format, the proposal includes using `dateutil library to parse the most known formats and then convert it to general format

So the code reduces to just:

    def process_item(self, item, spider):
        date_parsed = parser.parse(item['date'], ignoretz=False)
        item['date'] = date_parsed.strftime('%Y-%m-%d %H:%M:%S %z')
        return item
atb00ker commented 7 years ago

(3) The Change being asked is not be included in Pipeline but in spider itself to prevent it from sending the request to fetch and scrape the url that's already been scraped. The idea is to prevent the request not drop the item.

Please read the code in the spiders, they do not send unnecessary requests. :)

thisisayush commented 7 years ago

(5) Take a look at the methodology adopted at https://github.com/vipulgupta2048/scrape/blob/Scrapyd/thisisayush/news18/news18/pipelines.py#L13-L59 and https://github.com/vipulgupta2048/scrape/blob/Scrapyd/thisisayush/news18/news18/spiders/news18spider.py#L11-L18

atb00ker commented 7 years ago

The above method defines processing datetime spiderwise to convert them to general format, the proposal includes using `dateutil library to parse the most known formats and then convert it to general format.

If dateutil handles this more efficiently, i would welcome the change. :)

atb00ker commented 7 years ago

Current Table Schema for news_table: link doesn't has UNIQUE Constraint to prevent mistakes of adding duplicates. newsdate is varchar while it should be timetz Proposed Addition of scraped_date in table

Issue active in #35 ; Please refer this. Foreign Key site_id doesn't define ON DELETE I agree the above mentioned change in database.

thisisayush commented 7 years ago

Please read the code in the spiders, they do not send unnecessary requests. :)

I have read and then suggested the change :-) Take a look at two snippets below:

# Spider asianage
 for url in self.start_urls:
            yield scrapy.Request(url, self.parse)
# Spider firsposthindi, firstpostsports, etc included in commit 66b5622bb4bbb09cce9fa99451840446dac37555
 self.cursor.execute("""SELECT link from news_table where link= %s """, (link,))
 if not self.cursor.fetchall():
     yield scrapy.Request(url=link, callback=self.parse_article)
# Hindustan
for somethings in something:
    link = somethings.xpath('.//div[contains(@class,"media-body")]/div[1]/a/@href').extract_first()
    yield scrapy.Request(url=link, callback=self.fun)

All of these includes either unhandled requests to server or contains a database query embedded in the spider itself (Just think if table names are changed, shit! there are supposed to be 60 spiders man)

How about?

if not NewsDatabase().urlExists(href):
    yield scrapy.Request(url = href, callback=self.parse_news)
atb00ker commented 7 years ago
if not NewsDatabase().urlExists(href):
    yield scrapy.Request(url = href, callback=self.parse_news)

I Agree with this change.

atb00ker commented 6 years ago

Issue seems solved, Closing.