hep-gc / cloud-scheduler

Automatically boot VMs for your HTC jobs
http://cloudscheduler.org
Apache License 2.0
3 stars 0 forks source link

Decay in Job Poller execution interval #188

Closed SharonGoliath closed 12 years ago

SharonGoliath commented 13 years ago

When CS starts up on the CADC bifrost installation, the interval between Job Poller cyles is around 30s. The length of this interval gradually decays.

Identified intervals:

CS immediately after restart - 40s restart:

2011-05-16 14:19:16,705 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-16 14:19:50,376 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-16 14:20:24,133 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-16 14:20:57,885 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-16 14:21:32,236 - DEBUG - JobPoller - Job Poller waiting 30s...

CS running for 16 hours - interval 16 minutes:

2011-05-17 10:25:48,824 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-17 10:41:43,683 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-17 10:58:43,713 - DEBUG - JobPoller - Job Poller waiting 30s...

CS running for 6 days - interval 3.5 hrs:

2011-05-16 05:15:43,822 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-16 08:46:17,463 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-16 12:12:33,402 - DEBUG - JobPoller - Job Poller waiting 30s...

CS running for n days - interval up to 5 hrs:

2011-04-27 05:09:24,791 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-04-27 09:38:34,886 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-04-27 14:10:34,543 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-04-27 18:53:10,534 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-04-27 23:57:13,673 - DEBUG - JobPoller - Job Poller waiting 30s...

CS running for n days + 5 - interval up to 7 hrs:

2011-05-03 07:20:58,086 - DEBUG - JobPoller - Job Poller waiting 30s...
2011-05-03 14:38:49,597 - DEBUG - JobPoller - Job Poller waiting 30s...

condor_retrieval_method is set to "local"

Characteristics of bifrost (condor central manager):

Random load check during the 5 hr delay event:

top - 14:22:07 up 106 days,  8:03,  4 users,  load average: 2.84, 2.39, 2.34
Tasks: 162 total,   1 running, 161 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  1.3%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.0%si,  1.3%st
Mem:  16029840k total,  8311968k used,  7717872k free,   826184k buffers
Swap:  9437176k total,        0k used,  9437176k free,  6043568k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28149 canfradm  15   0  450m 214m 4552 S 17.3  1.4   2913:35 python
9508 condor    15   0 52980  25m 2796 S  0.3  0.2  45:25.48 condor_collecto
11859 canfradm  15   0 12744 1140  808 R  0.3  0.0   0:00.01 top
   1 root      15   0 10352  660  560 S  0.0  0.0   2:19.69 init
   2 root      RT  -5     0    0    0 S  0.0  0.0   0:30.33 migration/0
   3 root      34  19     0    0    0 S  0.0  0.0   0:02.82 ksoftirqd/0
   4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.03 watchdog/0
   5 root      RT  -5     0    0    0 S  0.0  0.0   0:08.82 migration/1
   6 root      34  19     0    0    0 S  0.0  0.0   0:01.22 ksoftirqd/1
   7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.19 watchdog/1
   8 root      RT  -5     0    0    0 S  0.0  0.0   0:05.57 migration/2

Processor information:

[canfradm@bifrost log]$ cat /proc/cpuinfo | grep processor | wc -l
8

Here's the info for processor 0:

[canfradm@bifrost log]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
cpu MHz         : 3158.768
cache size      : 6144 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu tsc msr pae cx8 apic mtrr cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc pni est ssse3 cx16 sse4_1 lahf_lm
bogomips        : 7899.45
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:
igable commented 13 years ago

Sharon what's the canfradm process.

SharonGoliath commented 13 years ago

The python one is Cloud Scheduler. canfradm is the user that runs CS.

igable wrote:

Sharon what's the canfradm process.

mhpx commented 13 years ago

Ran two tests here: one with Python 2.6, other with Python2.7 2.6 Right when CS starts up with ~2300 jobs in the queue 2011-05-16 15:25:16,163 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 15:25:25,490 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 15:25:34,953 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 15:25:45,413 - DEBUG - JobPoller - Job Poller waiting 5s... ... ( few hours later ) 2011-05-16 23:06:46,820 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 23:06:55,158 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 23:07:03,867 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 23:07:12,397 - DEBUG - JobPoller - Job Poller waiting 5s... ... ( less jobs in queue now ) 2011-05-17 05:38:31,162 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 05:38:37,885 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 05:38:44,872 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 05:38:51,865 - DEBUG - JobPoller - Job Poller waiting 5s...

Then again with Python2.7 At startup no jobs. 2011-05-17 11:49:23,499 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:49:28,699 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:49:33,891 - DEBUG - JobPoller - Job Poller waiting 5s... ... ~2300 jbos submitted few minutes later 2011-05-17 11:53:07,167 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:53:16,546 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:53:25,887 - DEBUG - JobPoller - Job Poller waiting 5s... ... after several hours 2011-05-17 20:25:48,892 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 20:26:00,970 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 20:26:13,419 - DEBUG - JobPoller - Job Poller waiting 5s... ... 2011-05-18 02:12:12,322 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 02:12:23,009 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 02:12:33,390 - DEBUG - JobPoller - Job Poller waiting 5s... ... 2011-05-18 08:55:12,911 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 08:55:22,444 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 08:55:31,962 - DEBUG - JobPoller - Job Poller waiting 5s...

in both cases with ~2200 jobs CS is taking ~5 seconds to run through the entire JobPoller cycle even after ~22hours could not see any slowdown on my workstation - bifrost was having a large delay after only 16 hours. Still trying to do more tests

igable commented 13 years ago

Can we get the the output of rpm -qa . Maybe attach as a file rather then raw output pasted into the comment.

mhpx commented 13 years ago

RPM configuration for Bifrost from John https://gist.github.com/979231 here's the RPM configuration for Mike's Test machine https://gist.github.com/979258

SharonGoliath commented 13 years ago

Here's recent output from the Cloud Scheduler running on a 32-bit host (canfar.dao.nrc.ca) here:

2011-05-18 11:46:42,078 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:46:47,676 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:46:53,290 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:46:58,880 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:04,792 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:11,902 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:19,959 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:28,875 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:35,407 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:41,230 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:46,953 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:52,545 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:59,241 - DEBUG - JobPoller - Job Poller waiting 5s...

This installation was used for the workshop last week, so has never had high numbers of queued jobs, and I'm reluctant to test it with spurious jobs, since it's the closest thing we have to "operational".

Sharon

mhpx commented 13 years ago

diff of the two sets of rpms https://gist.github.com/989254

Looks like Bifrost is still on 5.4 so the rpm diffs not very helpful

igable commented 13 years ago

Sharon can you update the machine to SL 5.5, this RPM comparison is not going to be meaningful until that happens.

mhpx commented 13 years ago

Checked the last cloudscheduler log on the old canfarpool system. Got a log snippet of it running between March 14 7am, until Mar 16th 3:30pm It Started with 11789 Jobs Ended with 8036 jobs over the ~2 days, initial parsing times were ~15 seconds, decreasing to ~11 seconds after two days and the total jobs decreased. So no performance loss seen there.

SharonGoliath commented 13 years ago

Here's some sample output after the bifrost upgrade to SL55 - there are ~2k jobs in the queue, and none of them are joining the condor pool due to a firewall issue.

[canfradm@bifrost log]$ grep "Job Poller waiting" /var/log/cloudscheduler.log 2011-05-26 04:06:24,968 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 04:28:17,475 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 04:49:51,352 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 05:12:32,864 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 05:36:39,003 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 06:01:24,148 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 06:26:50,265 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 06:52:43,971 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 07:19:18,178 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 07:46:50,707 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 08:15:04,374 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 08:43:14,664 - DEBUG - JobPoller - Job Poller waiting 30s...

mhpx commented 13 years ago

Time Intervals extracted for ease of reading: 0:21:53 0:21:34 0:22:41 0:24:07 0:24:45 0:25:26 0:25:53 0:26:35 0:27:32 0:28:14 So definitly still seeing the decay.

mhpx commented 13 years ago

Can we get the rpm -qa output from bifrost again now that it's updated?

SharonGoliath commented 13 years ago

vmrepo.phys.uvic.ca:/var/www/html/vms/bifrost_rpm_qa_20110526.out

mhpx wrote:

Can we get the rpm -qa output from bifrost again now that it's updated?

mhpx commented 13 years ago

only thing that really stood out to me in the diff was the use of the java openjdk vs the one from sun... the workspace calls to Nimbus do use the jvm.

mhpx commented 13 years ago

Problem with performance has been identified in the logging module and that multiple threads are vying for use of the log, the log is thread safe and uses locks to ensure so. CS is spending a ton of time waiting on locks to access the log. I'm going to work on a better way to deal with logging, in the meantime - lower the log level down to INFO or WARNING and it should improve the performance since it won't be trying to log as much.

igable commented 13 years ago

Good spotting.

Mike I would say that the best bet here is to advise that we can't do verbose logging for 100 K jobs. The danger is that we try to improve on the python standard logging and we don't do a good enough job. Is there some drop in replacement for Logger that would improve this.

mhpx commented 13 years ago

I've added some warning messages to the cloud_scheduler.conf in the logging section, and also on startup if its on verbose or debug logging to be mindful of how many jobs are being submitted. I'm still looking for an alternative way to handle the multithreaded logging better, tried using a socketserver setup but it had similar problems.

SharonGoliath commented 13 years ago

I tried to get a picture of the effect on performance of the change in logging level - CS is really quiet at the INFO level if no one is submitting new job types. At 6:56AM June 7, I submitted 100 jobs to a fully occupied cluster. The first logging messages in /var/log/cloudscheduler.log showed up at 9:07AM - here's a representative snippet:

2011-06-07 09:07:11,333 - INFO - Cleanup - Returning resources used by VM 3434 to Cluster hermes-westgrid 2011-06-07 09:07:12,011 - INFO - Cleanup - Destroyed vm 3434 on synnefo.westgrid.ca 2011-06-07 09:07:18,980 - INFO - Scheduler - Started vm 3568 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/canfarbase_i386_seb_cadc.img.gz 2011-06-07 09:14:46,997 - INFO - Cleanup - Returning resources used by VM 3440 to Cluster hermes-westgrid 2011-06-07 09:14:47,781 - INFO - Cleanup - Destroyed vm 3440 on synnefo.westgrid.ca 2011-06-07 09:14:54,724 - INFO - Scheduler - Started vm 3569 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/canfarbase_i386_seb_cadc.img.gz

...

The timestamp on the cloud_scheduler.pid file is June 3 11:01, so CS has been running for 5 days, and is showing about a 2 hour lag in response time, logging only at the INFO level.

There are currently 1623 jobs total in the condor queue.

Speaking of logging levels - John and I were talking about the necessary logging messages - here's a set of suggestions of what to see in INFO level log messages:

  1. Starting VM #### (type xxxxxxx) on cluster YYYY (already there, don't take it out)
  2. Destroying/Retiring VM #### (type xxxxxx) on cluster YYYY (already there, don't take it out) - what would be really good to add to the destroying message is the reason why it's being destroyed (end of lease, balancing, unused, etc)
  3. Error messages (eg. 'network 'private' is not currently available', host cert expired - not sure if they're there, haven't seen any since changing the logging levels) .
  4. Recognition of a new VMType from condor_q output
  5. Recognition of absence of an existing VMType from condor_q output
  6. Configuration information on start up.
SharonGoliath commented 13 years ago

Another performance test:

  1. At 6:52 AM June 9, remove 100 jobs from queue as user 2 (removes a VM type from the condor job queue).
  2. At 6:55 AM, CS starts shutting down VMs of the type for user 2, and provisioning VMs of the type for user 1.

Is the slow response limited to starting VMs?

SharonGoliath commented 13 years ago

Added to the previous comment - the last of 86 VMs was replaced at this time:

2011-06-09 07:22:58,985 - INFO - Cleanup - Returning resources used by VM 1550 to Cluster cadc 2011-06-09 07:22:59,605 - INFO - Cleanup - Destroyed vm 1550 on iris.cadc.dao.nrc.ca 2011-06-09 07:23:02,246 - INFO - Scheduler - Started vm 1551 on iris.cadc.dao.nrc.ca using image at http://vmrepo.phys.uvic.ca/vms/gwyn-50000000000052243.img.gz

SharonGoliath commented 13 years ago

More performance timing:

I submitted a single job as User 2 at 17:30, June 9, to a system fully occupied by jobs from User1. CS logs the first attempt to shut down the VMs of User1 to make room for this job at 23:58. There are currently 1321 jobs in the queue. Here's the relevant snippet of the log file:

2011-06-09 07:59:24,814 - INFO - Cleanup - Destroyed vm 3997 on synnefo.westgrid.ca 2011-06-09 07:59:27,790 - INFO - Scheduler - Started vm 4056 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/gwyn-50000000000052243.img.gz 2011-06-09 23:58:19,920 - INFO - Cleanup - Returning resources used by VM 3826 to Cluster hermes-westgrid 2011-06-09 23:58:20,921 - INFO - Cleanup - Destroyed vm 3826 on synnefo.westgrid.ca 2011-06-09 23:58:27,496 - INFO - Scheduler - Started vm 4057 on synnefo.westgrid.ca using image at https://www4.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/vospace/CADCAuthtest1/SGOTEST/vmRepositoryPrivate/fabbros-50000000000060478.img.gz

SharonGoliath commented 13 years ago

I duplicated the good performance on a job removal as well.

condor_rm at 6:44:47 am, June 10, CS logged the shutdown of the VM at 6:44:58 am. Here's the log:

2011-06-10 06:42:23,890 - INFO - Scheduler - Started vm 4153 on synnefo.westgrid.ca using image at https://www4.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/vospace/CADCAuthtest1/SGOTEST/vmRepositoryPrivate/fabbros-50000000000060478.img.gz 2011-06-10 06:44:58,634 - INFO - Cleanup - Returning resources used by VM 4153 to Cluster hermes-westgrid 2011-06-10 06:44:59,306 - INFO - Cleanup - Destroyed vm 4153 on synnefo.westgrid.ca 2011-06-10 06:45:02,689 - INFO - Scheduler - Started vm 4154 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/gwyn-50000000000052243.img.gz

mhpx commented 13 years ago

Interesting that with the logging turned down cleanup and job poller threads are performing better and scheduler is suffering still/more. I've made a separate issue for the info logging messages and will put in those changes shortly. - https://github.com/hep-gc/cloud-scheduler/issues/191 I'm also still delving deeper into how the logging module works and exploring alternate handlers that may perform better in a threaded application.

SharonGoliath commented 13 years ago

Cloud Scheduler retires an existing VM when it recognizes a new VM Type, and waits for condor to gracefully shut down that job. Is that correct? Does it log the retirement? At what level?

When does CS first log an action when it recognizes a new VM Type? It's entirely possible that I'm confusing the logging timestamps with performance.

Should we try to get profiling of CS working on bifrost? Or would you rather concentrate on the changes to the info logging messages?

mhpx wrote:

Interesting that with the logging turned down cleanup and job poller threads are performing better and scheduler is suffering still/more. I've made a separate issue for the info logging messages and will put in those changes shortly. I'm also still delving deeper into how the logging module works and exploring alternate handlers that may perform better in a threaded application.

mhpx commented 13 years ago

It logs when it first executes the condor_off (retire) command, and then again when the Vm is being shutdown, both those are debug messages. There is no direct message that says "This vmtype is new" the closest is in the verbose logging of the JobPoller when it adds new jobs to the queue the new types will get added at that time.

Let me do the logging changes - I've also made a change to the scheduling algorithm to make it finish it's cycle faster. Once those changes are in place and installed on bifrost then it may help to get profiling output working.

SharonGoliath commented 13 years ago

This morning's test (June 16) - CS has been running for 22 hours :

Job submission at 6:31:40 AM, here's the job recognition by CS from the log at 6:32 AM:

2011-06-16 06:32:07,550 - INFO - JobPoller - goliaths-test1 vmtype added to required types 2011-06-16 06:32:12,964 - WARNING - Scheduler - Error creating VM goliaths-test1:

That looks pretty promising.

However, every creation by CS is currently resulting in error messages like this, "Problem: Resource request denied: Error creating workspace(s): No resource is available for this request (based on memory).", so comparing performance is questionable.

SharonGoliath commented 13 years ago

June 20 test - CS has been running since June 15:

Job submission (from command line):

Submitting job(s). 1 job(s) submitted to cluster 1041. Mon Jun 20 08:27:02 PDT 2011

Job recognition (from CS log):

2011-06-20 08:27:10,283 - INFO - JobPoller - goliaths-test1 vmtype added to required types

mhpx commented 13 years ago

is that on bifrost or the other machine with CS? because an 8 second response give or where it is in the scheduling / polling intervals seems fairly reasonable.

SharonGoliath commented 13 years ago

bifrost.

mhpx wrote:

is that on bifrost or the other machine with CS? because an 8 second response give or where it is in the scheduling / polling intervals seems fairly reasonable.

mhpx commented 12 years ago

Closing issue - was related to the logging service bogging down system. Some improvements made to help performance.