vipulgupta2048 / scrape

Mission - scraping the planet, one website at a time
MIT License
10 stars 13 forks source link

Inappropriate pipelines implementation #56

Closed thisisayush closed 6 years ago

thisisayush commented 6 years ago

Insertion operation in pipelines is being done in an inappropriate manner! Pipelines are supposed to process. filter, store, drop items within themselves or using external functions. The current implementation gives this complete job to database class, thereby taking away most essential features of pipelines viz. DropItem(), closeSpider(), etc.

Making a single pipeline do all the tasks isn't preferred because Scrapy queues many requests for pipelines. If one pipeline takes too much time doing things, other items need to wait a lot. While handling different functions in different pipelines makes data processing more reliable, faster and flexible!

Quoting Lines:

def insertIntoNewsTable(self, item):
        # Insert item into NEWS_TABLE after all the processing.
        try:
            if (self.connection.status != 1):
                self.openConnection(self.spider)
            postgresQuery = "INSERT INTO " + os.environ['NEWS_TABLE'] + " (title, content, image, link, newsDate, site_id, datescraped) VALUES (%s, %s, %s, %s, %s, %s, NOW())"
            processedDate = str(parser.parse(item.get('newsDate'), ignoretz=False))
            self.cursor.execute(postgresQuery,
                (item.get('title'),
                item.get('content'),
                item.get('image'),
                item.get('link'),
                processedDate,
                item.get('source')))
            self.spider.urls_stored += 1
        except psycopg2.IntegrityError as Error:
            # If the link already exists, this exception will be invoked
            if (Error.pgcode == '23505'):
                pass
            else:
                loggerError.error(str(Error) + " occured at " + str(item.get('link')))
        except Exception as Error:
            loggerError.error(str(Error) + " occured at " + str(item.get('link')))
        finally:
            return item

If any error occurs during insertion, or during date processing, or due to invalid data (checks for which are either invisible or not implemented), tracing back what caused the error is difficult! Also, the database shouldn't be queried if data is invalid or an error occurs! The connection can be used to process only valid entries!

atb00ker commented 6 years ago

Insertion operation in pipelines is being done in an inappropriate manner! Pipelines are supposed to process. filter, store, drop items within themselves or using external functions. The current implementation gives this complete job to database class, thereby taking away most essential features of pipelines viz. DropItem(), closeSpider(), etc.

Consider usage of the following class:

class ScrapenewsPipeline(object):
    # This class serves the purpose of cleaning the items received from spider,
    # Calling the instance of spider's postgresSQL class and calling functions
    # to create and destroy connection as and when needed.

    def open_spider(self, spider):
        # Called when a spider starts
        spider.postgres = postgresSQL()
        spider.postgres.openConnection(spider)

    def close_spider(self, spider):
        # Called when a spider closes it can't be used with logging enabled
        # because it is called just before the spider is closed and logging
        # table needs close reason, generated at close
        pass

    def process_item(self, item, spider):
        spider.postgres.insertIntoNewsTable(item)

Send data to the spider.postgres.insertIntoNewsTable(item) only after you are sure that the data is processed, currently it nothing is implemented because i do not see the need of any of the functions. :)

Note: spider.postgres.insertIntoNewsTable(item) is ideally suppose to enter data in PostgreSQL only. Any processing shall take place in process_item(self, item, spider); Hence, none of the features are sabotaged. :) Cheers.

Making a single pipeline do all the tasks isn't preferred because Scrapy queues many requests for pipelines.

Design is intentional to save RAM. Please consider throttling if any of the spiders misbehaves! :)

If any error occurs during insertion, or during date processing, or due to invalid data (checks for which are either invisible or not implemented), tracing back what caused the error is difficult! Also, the database shouldn't be queried if data is invalid or an error occurs! The connection can be used to process only valid entries!

if item['title'] is not 'Error' and item['content'] is not 'Error' and item['newsDate'] is not 'Error': This line is intended to solve your issue, consider the of the spiders for the same.

thisisayush commented 6 years ago

Any processing shall take place in process_item(self, item, spider)

I don't see any such processing happening in process_item, but yeah I can see it happening in processedDate = str(parser.parse(item.get('newsDate'), ignoretz=False)) in Database Class.

because i do not see the need of any of the functions. :)

No offense, but that's too much assumption! This is highly unreliable!

atb00ker commented 6 years ago

Any processing shall take place in process_item(self, item, spider)

I don't see any such processing happening in process_item, but yeah I can see it happening in processedDate = str(parser.parse(item.get('newsDate'), ignoretz=False)) in Database Class. because i do not see the need of any of the functions. :)

You are correct about the incorrect implantation in this case; feel free to change it.

because i do not see the need of any of the functions. :)

No offense, but that's too much assumption! This is highly unreliable!

There are no assumptions here, do consider reading the code before posting questions. Cheers.

thisisayush commented 6 years ago

There are no assumptions here, do consider reading the code before posting questions. Cheers.

The extract posted in the question is a part of code. So, I read the complete code before creating the issue. -.-

atb00ker commented 6 years ago

There are no assumptions here, do consider reading the code before posting questions.

The extract posted in the question is a part of code. So, I read the complete code before creating the issue. "-.-"

Excuse me, i have failed to understand the "assumptions" part in that case. Do through some light on it. :)

thisisayush commented 6 years ago

Excuse me, i have failed to understand the "assumptions" part in that case. Do through some light on it. :)

The assumption that spiders will only send the correct and complete data. :)

atb00ker commented 6 years ago

Excuse me, i have failed to understand the "assumptions" part in that case. Do through some light on it. :)

The assumption that spiders will only send the correct and complete data. :)

I think you've not read the code properly, please consider reading the spiders; specifically, if item['title'] is not 'Error' and item['content'] is not 'Error' and item['newsDate'] is not 'Error': this line is used to make sure incorrect data format is not send to the pipelines file! If an incorrect date format is send to the pipelines; pipelines[167]: except Exception as Error: is suppose to cache the error! :)

thisisayush commented 6 years ago

if item['title'] is not 'Error' and item['content'] is not 'Error' and item['newsDate'] is not 'Error':

Implementing this in every spider just reduced the chances of errors from that spider(s). Pipelines must be ready to make sure no unwanted info is stored in database due to any single faulty spider or logic implementation. P.S. There are many developers Designing their spiders. All using their methods to minimise errors. You can't guarantee no errors. :) Or if you can, write and give me, I'm good with it.

If an incorrect date format is send to the pipelines; pipelines[167]: except Exception as Error: is suppose to cache the error! :)

And tell me a way how will I know it was date and not database which caused error. Oh wait, I just though of implementing more complex date extraction and data processing before dumping into Database. So bad, I'll have to modify the function and add so many lines of code to it that it'll not live to it's name.

atb00ker commented 6 years ago

if item['title'] is not 'Error' and item['content'] is not 'Error' and item['newsDate'] is not 'Error':

Implementing this in every spider just reduced the chances of errors from that spider(s). Pipelines must be ready to make sure no unwanted info is stored in database due to any single faulty spider or logic implementation. P.S. There are many developers Designing their spiders. All using their methods to minimise errors. You can't guarantee no errors. :) Or if you can, write and give me, I'm good with it.

Feel free to make a change to implement the the re-check in the main pipelines class.

If an incorrect date format is send to the pipelines; pipelines[167]: except Exception as Error: is suppose to cache the error! :)

And tell me a way how will I know it was date and not database which caused error.

You can know that by reading the log file as the error would be present in it! :)

Oh wait, I just though of implementing more complex date extraction and data processing before dumping into Database. So bad, I'll have to modify the function and add so many lines of code to it that it'll not live to it's name.

However, while i don't see how this string related in the ongoing discussion but the issue is a possibility, feel free to make a function in the main pipeline class to read the date format and check it in before passing it to the postgreSQL class. :)