Closed luckyyezi closed 5 years ago
Please provide the latest log in http://127.0.0.1:5000/schedule/history/
################################################## 2019-03-18 18:30:21
task_data for scheduler.add_job():
{
"coalesce": true,
"day": "",
"day_of_week": "",
"end_date": null,
"hour": "",
"id": "1",
"jitter": 0,
"max_instances": 1,
"minute": "",
"misfire_grace_time": 600,
"month": "",
"name": "test_schedule",
"second": "2",
"start_date": null,
"timezone": "Asia/Shanghai",
"trigger": "cron",
"week": "",
"year": "*"
}
################################################## 2019-03-18 17:48:41
Could you provide the screenshot of the Timer Tasks page with the parameters of task #1 displayed, as well as http://127.0.0.1.com:5000/1/tasks/1/
Like this?
The screenshot of the Timer Tasks page with the parameters of task #1 displayed? btw, did you modify the source code of schedule.py before?
Yes, with the parameters of task #1. And i just used pip to install scrapydweb without any modification.
Please show me sth like this:
The parameters of task #1 indicate that the task would be executed when the second is 2, and there is nothing wrong with the execution results! It's weired that the day and the minute being empty string in the schedule history. How did you fill in these inputs when adding a task?
"day": "",
"minute": "",
I used the default values of 'day' or 'minute'. The task will be fired every minute, but i think if the job fired last minute is still running, then it should not be fired again. Is it right?
But the value of day and hour should be '*' in the history, could you try to add another task without modifying the parameters of timer task and show me the log again? The scheduler of Timer Tasks knows nothing about the scraping jobs. Please check out the related links in the HELP section at the top of the Run Spider page. https://apscheduler.readthedocs.io/en/latest/userguide.html#limiting-the-number-of-concurrently-executing-instances-of-a-job
Do i misunderstand the meaning of "job"? I mean if one spider is fired, it gets a job id like "task_1_2019-03-18T18_53_02", then one minute later, the spider is fired again, then it gets another job id "task_1_2019-03-18T18_54_02", so these two are different jobs? Not two instances of one job?
In Scrapy and Scrapyd, scheduling a spider run would result in a scraping job. In APScheduler, firing a task would cause another kind of job instance. And the job instance goes away when the execution of the corresponding task is finished, no matter the scraping job sechduled is finished or not.
I see, these are two different kinds of jobs. You are so nice to explain these details for me. One more question, in my situation, i have a spider which may crawl for several days, but not sure about how long it will take to finish. I want to schedule the spider so that once it finishes, it will be fired automatically after one day or several days. So is there any solution for this situation?
There are two solution:
In Scrapy and Scrapyd, scheduling a spider run would result in a scraping job. In APScheduler, firing a task would cause another kind of job instance. And the job instance goes away when the execution of the corresponding task is finished, no matter the scraping job sechduled is finished or not.
[UPDATE]
misfire_grace_time
, coalesce
, and max_instances
make sense when APScheduler is too busy, or when ScrapydWeb is restarted, or when you add a task to fire every second while it takes more than one second to finish the task execution. Hi, I am thinking about doing a change that is related to this. I understand how the max_instances
setting of APScheduler works, and it's clear to me that this has nothing to do with the state of a scraping job in scrapyd.
However, I do think that it is a useful feature to have a way to enforce that only a single scrape job of a type is running simultaneously.
Additionally, scrapydweb exposes the max_instances in the UI under the "show more timer settings". It's not really clear to me why someone would want to change the max_instances
setting in the way its currently working.
What I thought is that it could make sense to enforce that the max_instances
setting actually ensures that no more than that number of scraping jobs are scheduled in scrapyd.
This way, if it is set to '1', when it comes the time to schedule a new job, if there is one already running or pending, scrapydweb would skip scheduling yet another scraping job on scrapyd.
I'm thinking about implementing this on my fork because it's useful for me, but I wanted to ask here to see if this would also be interesting in general and I should do a PR for this functionality if I decide to develop it.
I’ve explained the meaning of max_instances in the previous comment.
Scrapydweb exposes these apscheduler parameters for advanced control.
How would you determine if one job should be skipped when there’s already one same job running with different parameters?
Yes, indeed, your explanation is clear, also in the APScheduler docs its clear too.
To answer your question, I have a question before, which might make what I want to do obsolete:
Can you describe a case where someone would want to set the max_instances to anything other than '1' with the current functionality?
It's up to the user, not something I should think about.
Alright, then I think this will probably not be something interesting to have in general.
The way I see it, this setting is right now very unintuitive and related to the internal working of APScheduler. As a user, the intuitive interpretation (from my subjective PoV, of course) for this setting is that it would define the maximum number of scraping instances, not APScheduler jobs that schedule scrapyd jobs.
So, what I plan on doing is that this setting will actually define that. When scheduling a new job, scrapydweb will first poll to see how many jobs are running or pending for this task, and enforce this limit based on this property.
I will keep it in my fork, please let me know if you think it would be interesting to do a PR for this, and I'll be happy to do so.
When i set a schedule on a spider, i set "max_instances" to 1 and "coalesce" as "True", but it seems not work. After a moment, the spider has more than one instance are running.