Cron Jobs - > Scrapys + Scheduler

thisisayush commented 6 years ago

Current Spider Scheduling allows only 1 spider running at a time and to be run after a fixed interval on the assumption that spider has finished operation before next schedule is called. This is inefficient and can result in unexpected failures. Cron Jobs are good but cannot be trusted in this scenario. Scrapyd provides easy API for scheduling and running spiders concurrently. Scheduler on the other hand allows a python method to be executed again and again on a predefined interval. Combining it with Scrapyd API and intelligent code, the scheduling can be optimised and automated (for reference, thisisayush/scrape)

atb00ker commented 6 years ago

I feel this is not necessary for my spiders. Reason: (1) My spiders scrape only one-four page(s) and they do it in seconds, the scheduling difference is one hour so, it is unnecessary to move out of the current (more reliable in my case) system. Question(s): Would the changes affect other spiders as well? Does the new system need require to download more new files? Would your new system clash with the running crontab jobs?

thisisayush commented 6 years ago

My spiders scrape only one-four page(s) This isn't the best practice, the spider should scrape all pages and not duplicate links. Say if your spider missed a schedule or server went into maintenance for the time your spider was supposed to run, does that mean the one-four pages of that time shall be ignored? Would the changes affect other spiders as well? Scrapyd is compatible with existing spiders and do not require any changes in code. Does the new system need require to download more new files? It Requires three libraries: Scrapyd, Schedule & scrapyd-client and their dependencies Would your new system clash with the running crontab jobs? It runs standalone on a tmux session and has no role in crontab jobs and can run independently on a python environment.

thisisayush commented 6 years ago

There is one more problem with the current approach being used, If any spider throws error and quits, the rest are affected. While Using Scrapyd, all spiders are run in parallel independent of each other, which reduces overall time and makes it more relaible

atb00ker commented 6 years ago

My spiders scrape only one-four page(s).

This isn't the best practice.

We have discussed this at length, I still strongly disagree.

Would your new system clash with the running crontab jobs?

It runs standalone on a tmux session and has no role in crontab jobs and can run independently on a python environment.

In this case feel free to make changes, I do not wish to port my spiders, so leave the documentation of the current system and add your documentation in it!

I feel this is not necessary for my spiders.

If any spider throws error and quits, the rest are affected. While Using Scrapyd, all spiders are run in parallel independent of each other, which reduces overall time and makes it more reliable.

That only happens when a python syntax error is detected, like:

Using spaces and tabs both in one python file.

Using : instead of = when intention is to equate the LHS and RHS, However, we do not know such bad programmers, do we? (Ahem)

However, in the interest of investigation, i propose, "pretend" to be a bad programmer, use tabs and spaces in one file and run a parallel spider, if it runs, using Scrapyd might be useful, because face it, too many (Ahem) coders have permission to merge the pull requests here, who knows, one might see a bad coder amidst.

thisisayush commented 6 years ago

We have discussed this at length, I still strongly disagree. This doesn't change the fact though. However, if we have enough scraping loga and data then we can sure find a pattern to come with best method. In this case feel free to make changes, I do not wish to port my spiders, so leave the documentation of the current system and add your documentation in it! I am not currently modifying ScrapeNews because I disagree with most things done there. No Offense Meant. ScrapeNews is currently stable while what I'm working with is under testing and not ready for deployement. That only happens when a python syntax error is detected, like: I would just post a error log for this:
2017-11-30 13:09:47,007 [ERROR   ] news18spider.py Line 102 : news18       news18.spiders.news18spider Error Extracting Data for URL http://www.news18.com/news/movies/baadshaho-first-looks-of-esha-gupta-sanjay-mishras-characters-are-out-1436549.html
Now this error was logged because, in one of the test runs, I found some patterns for errors in XPath and to detect them a try-except was applied explicitly to avoid code phat gaya However, you can't always predict what error occurs during runtime.

i propose, "pretend" to be a bad programmer, use tabs and spaces in one file and run a parallel spider, Oh Boy, I already code assuming the worst way this code may fail. Coming to running a parallel spider, I tried it with 6 spiders and yeah they were auto-scheduled and ran successfully.

oo many (Ahem) coders have permission to merge the pull requests here, who knows, one might see a bad coder amidst. Ans that's why your pipelines and framework must be good enough all types of Exceptions :-)

atb00ker commented 6 years ago

Your opinion has been noted. I do not wish to port my spiders to scrapyd.

thisisayush commented 6 years ago

You don't need to port.. a single command scrapyd-deploy do it for you :-) Anyways choice is yours 😀

atb00ker commented 6 years ago

I strongly disagree with the usage of scrapyd, @parthsharma2 & @vipulgupta2048; pick an opinion on this and close the issue.

vipulgupta2048 / scrape

Cron Jobs - > Scrapys + Scheduler #36