my8100 / scrapydweb

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
https://github.com/my8100/files
GNU General Public License v3.0
3.15k stars 562 forks source link

"max_instances" setting is not work #30

Closed luckyyezi closed 5 years ago

luckyyezi commented 5 years ago

When i set a schedule on a spider, i set "max_instances" to 1 and "coalesce" as "True", but it seems not work. After a moment, the spider has more than one instance are running.

my8100 commented 5 years ago

Please provide the latest log in http://127.0.0.1:5000/schedule/history/

luckyyezi commented 5 years ago

################################################## 2019-03-18 18:30:21 ['10.8.5.40:6800'] curl http://10.8.5.40:6800/schedule.json -d project=helloworld -d _version=1552895608 -d spider=books -d jobid=2019-03-18T18_30_20 Update task #1 (test_schedule) successfully, next run at 2019-03-18 18:31:02+08:00. kwargs for execute_task(): { "task_id": 1 }

task_data for scheduler.add_job(): { "coalesce": true, "day": "", "day_of_week": "", "end_date": null, "hour": "", "id": "1", "jitter": 0, "max_instances": 1, "minute": "", "misfire_grace_time": 600, "month": "", "name": "test_schedule", "second": "2", "start_date": null, "timezone": "Asia/Shanghai", "trigger": "cron", "week": "", "year": "*" } ################################################## 2019-03-18 17:48:41

my8100 commented 5 years ago

Could you provide the screenshot of the Timer Tasks page with the parameters of task #1 displayed, as well as http://127.0.0.1.com:5000/1/tasks/1/

luckyyezi commented 5 years ago

Like this?

luckyyezi commented 5 years ago

1

my8100 commented 5 years ago

The screenshot of the Timer Tasks page with the parameters of task #1 displayed? btw, did you modify the source code of schedule.py before?

luckyyezi commented 5 years ago

Yes, with the parameters of task #1. And i just used pip to install scrapydweb without any modification.

my8100 commented 5 years ago

Please show me sth like this: image

luckyyezi commented 5 years ago

2

my8100 commented 5 years ago

The parameters of task #1 indicate that the task would be executed when the second is 2, and there is nothing wrong with the execution results! It's weired that the day and the minute being empty string in the schedule history. How did you fill in these inputs when adding a task?

"day": "",
"minute": "",
luckyyezi commented 5 years ago

I used the default values of 'day' or 'minute'. The task will be fired every minute, but i think if the job fired last minute is still running, then it should not be fired again. Is it right?

my8100 commented 5 years ago

But the value of day and hour should be '*' in the history, could you try to add another task without modifying the parameters of timer task and show me the log again? The scheduler of Timer Tasks knows nothing about the scraping jobs. Please check out the related links in the HELP section at the top of the Run Spider page. https://apscheduler.readthedocs.io/en/latest/userguide.html#limiting-the-number-of-concurrently-executing-instances-of-a-job

luckyyezi commented 5 years ago

4 3

luckyyezi commented 5 years ago

Do i misunderstand the meaning of "job"? I mean if one spider is fired, it gets a job id like "task_1_2019-03-18T18_53_02", then one minute later, the spider is fired again, then it gets another job id "task_1_2019-03-18T18_54_02", so these two are different jobs? Not two instances of one job?

my8100 commented 5 years ago

In Scrapy and Scrapyd, scheduling a spider run would result in a scraping job. In APScheduler, firing a task would cause another kind of job instance. And the job instance goes away when the execution of the corresponding task is finished, no matter the scraping job sechduled is finished or not.

luckyyezi commented 5 years ago

I see, these are two different kinds of jobs. You are so nice to explain these details for me. One more question, in my situation, i have a spider which may crawl for several days, but not sure about how long it will take to finish. I want to schedule the spider so that once it finishes, it will be fired automatically after one day or several days. So is there any solution for this situation?

my8100 commented 5 years ago

There are two solution:

  1. Enable the Email Notice feature of ScrapyWeb, and get notified when a scraping job is finished. Then fire the task manually.
  2. Catch the spider_closed signal of Scrapy and make a request to http://127.0.0.1:5000/1/tasks/xhr/fire/1/ to fire task #1 automatically.
my8100 commented 5 years ago

In Scrapy and Scrapyd, scheduling a spider run would result in a scraping job. In APScheduler, firing a task would cause another kind of job instance. And the job instance goes away when the execution of the corresponding task is finished, no matter the scraping job sechduled is finished or not.

[UPDATE]

payala commented 4 months ago

Hi, I am thinking about doing a change that is related to this. I understand how the max_instances setting of APScheduler works, and it's clear to me that this has nothing to do with the state of a scraping job in scrapyd.

However, I do think that it is a useful feature to have a way to enforce that only a single scrape job of a type is running simultaneously.

Additionally, scrapydweb exposes the max_instances in the UI under the "show more timer settings". It's not really clear to me why someone would want to change the max_instances setting in the way its currently working.

What I thought is that it could make sense to enforce that the max_instances setting actually ensures that no more than that number of scraping jobs are scheduled in scrapyd.

This way, if it is set to '1', when it comes the time to schedule a new job, if there is one already running or pending, scrapydweb would skip scheduling yet another scraping job on scrapyd.

I'm thinking about implementing this on my fork because it's useful for me, but I wanted to ask here to see if this would also be interesting in general and I should do a PR for this functionality if I decide to develop it.

my8100 commented 4 months ago

I’ve explained the meaning of max_instances in the previous comment. Scrapydweb exposes these apscheduler parameters for advanced control.
How would you determine if one job should be skipped when there’s already one same job running with different parameters?

payala commented 4 months ago

Yes, indeed, your explanation is clear, also in the APScheduler docs its clear too.

To answer your question, I have a question before, which might make what I want to do obsolete:

Can you describe a case where someone would want to set the max_instances to anything other than '1' with the current functionality?

my8100 commented 4 months ago

It's up to the user, not something I should think about.

payala commented 4 months ago

Alright, then I think this will probably not be something interesting to have in general.

The way I see it, this setting is right now very unintuitive and related to the internal working of APScheduler. As a user, the intuitive interpretation (from my subjective PoV, of course) for this setting is that it would define the maximum number of scraping instances, not APScheduler jobs that schedule scrapyd jobs.

So, what I plan on doing is that this setting will actually define that. When scheduling a new job, scrapydweb will first poll to see how many jobs are running or pending for this task, and enforce this limit based on this property.

I will keep it in my fork, please let me know if you think it would be interesting to do a PR for this, and I'll be happy to do so.