pythonhacker / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
1 stars 3 forks source link

Scheduling options in command-line #9

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently there is no way to schedule a crawl at a certain time of the day.
Implement this as an option on the command-line.

Original issue reported on code.google.com by abpil...@gmail.com on 25 Jun 2008 at 12:28

GoogleCodeExporter commented 9 years ago
I don't know if we want to build a scheduler? I was thinking that after I get my
config files, I would setup a cronjob (on linux)/scheduled task(windows) to 
start the
harvestman, with my -C config.xml as a parameter, and let it do its job. For me 
that
would seem more natural as I would prefer letting my operating system to run 
everything.

Original comment by szybal...@gmail.com on 25 Jun 2008 at 1:43

GoogleCodeExporter commented 9 years ago
I was thinking of something like a command line option which 
will accept a time parameter in many formats and then go to
the background only to wake up at the right time.

# Run 10 minutes from now
$ harvestman -C myconfig.xml --schedule +10m  
# Run an hour and 10 minutes from now
$ harvestman -C myconfig.xml --schedule +1:10
# Run at 19:00:00 hrs
$ harvestman -C myconfig.xml --schedule 19:00:00
# Run on Jun 30 at 10 am
$ harvestman -C myconfig.xml --schedule "2008-06-30T10:00:00"

For full date/time specification we need to accept only the ISO
format (yyyy-mm-ddTHH:MM:SS).

The only issue on the command-line might be that to go to background,
the harvestman process needs to fork another harvestman process and
send it to background to wake up at the right time and then die. 
The current process cannot go to background by itself :)

If you think cmdline is not required, we can at least implement this 
as a config option internally and implement a scheduler function 
which will run projects at a specified date/time/datetime. We can
use datetime module to specify the time. This would be a nice
GUI option which we can then enable on the GUI, if not on the command line.

Original comment by abpil...@gmail.com on 25 Jun 2008 at 1:57

GoogleCodeExporter commented 9 years ago
I wonder how hard would it be to setup cronjob file from the commands you 
posted. 
One issue with the going to sleep part is, what happens when you restart a 
computer?
Go from
#harvestman -C myconfig.xml --schedule 19:00:00

to
cronjob.harvestman
0 19 * * * harvestman -C myconfig.xml

The OS scheduler has many options. You could run harvestman every 5min, every 
night
at 23, etc... I don't know of a python package that has all these capabilities. 
If we
build good interface to cronjob/scheduled task that seems to me would give us 
most
options, as far as scheduling goes.

Original comment by szybal...@gmail.com on 25 Jun 2008 at 2:10

GoogleCodeExporter commented 9 years ago
Ok, I will mark this as WontFix in this case meaning, "Look at this later".

Original comment by abpil...@gmail.com on 28 Jun 2008 at 2:28

GoogleCodeExporter commented 9 years ago
I found there is scheduler function in python
http://docs.python.org/lib/module-sched.html
and there is a scheduler for turbogears project 
http://docs.turbogears.org/1.0/Scheduler

Which might be looked at at future time.

Original comment by szybal...@gmail.com on 29 Jul 2008 at 1:01