Closed SharonGoliath closed 12 years ago
Sharon what's the canfradm
process.
The python one is Cloud Scheduler. canfradm is the user that runs CS.
igable wrote:
Sharon what's the
canfradm
process.
Ran two tests here: one with Python 2.6, other with Python2.7 2.6 Right when CS starts up with ~2300 jobs in the queue 2011-05-16 15:25:16,163 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 15:25:25,490 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 15:25:34,953 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 15:25:45,413 - DEBUG - JobPoller - Job Poller waiting 5s... ... ( few hours later ) 2011-05-16 23:06:46,820 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 23:06:55,158 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 23:07:03,867 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-16 23:07:12,397 - DEBUG - JobPoller - Job Poller waiting 5s... ... ( less jobs in queue now ) 2011-05-17 05:38:31,162 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 05:38:37,885 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 05:38:44,872 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 05:38:51,865 - DEBUG - JobPoller - Job Poller waiting 5s...
Then again with Python2.7 At startup no jobs. 2011-05-17 11:49:23,499 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:49:28,699 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:49:33,891 - DEBUG - JobPoller - Job Poller waiting 5s... ... ~2300 jbos submitted few minutes later 2011-05-17 11:53:07,167 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:53:16,546 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 11:53:25,887 - DEBUG - JobPoller - Job Poller waiting 5s... ... after several hours 2011-05-17 20:25:48,892 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 20:26:00,970 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-17 20:26:13,419 - DEBUG - JobPoller - Job Poller waiting 5s... ... 2011-05-18 02:12:12,322 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 02:12:23,009 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 02:12:33,390 - DEBUG - JobPoller - Job Poller waiting 5s... ... 2011-05-18 08:55:12,911 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 08:55:22,444 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 08:55:31,962 - DEBUG - JobPoller - Job Poller waiting 5s...
in both cases with ~2200 jobs CS is taking ~5 seconds to run through the entire JobPoller cycle even after ~22hours could not see any slowdown on my workstation - bifrost was having a large delay after only 16 hours. Still trying to do more tests
Can we get the the output of rpm -qa . Maybe attach as a file rather then raw output pasted into the comment.
RPM configuration for Bifrost from John https://gist.github.com/979231 here's the RPM configuration for Mike's Test machine https://gist.github.com/979258
Here's recent output from the Cloud Scheduler running on a 32-bit host (canfar.dao.nrc.ca) here:
2011-05-18 11:46:42,078 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:46:47,676 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:46:53,290 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:46:58,880 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:04,792 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:11,902 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:19,959 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:28,875 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:35,407 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:41,230 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:46,953 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:52,545 - DEBUG - JobPoller - Job Poller waiting 5s... 2011-05-18 11:47:59,241 - DEBUG - JobPoller - Job Poller waiting 5s...
This installation was used for the workshop last week, so has never had high numbers of queued jobs, and I'm reluctant to test it with spurious jobs, since it's the closest thing we have to "operational".
Sharon
diff of the two sets of rpms https://gist.github.com/989254
Looks like Bifrost is still on 5.4 so the rpm diffs not very helpful
Sharon can you update the machine to SL 5.5, this RPM comparison is not going to be meaningful until that happens.
Checked the last cloudscheduler log on the old canfarpool system. Got a log snippet of it running between March 14 7am, until Mar 16th 3:30pm It Started with 11789 Jobs Ended with 8036 jobs over the ~2 days, initial parsing times were ~15 seconds, decreasing to ~11 seconds after two days and the total jobs decreased. So no performance loss seen there.
Here's some sample output after the bifrost upgrade to SL55 - there are ~2k jobs in the queue, and none of them are joining the condor pool due to a firewall issue.
[canfradm@bifrost log]$ grep "Job Poller waiting" /var/log/cloudscheduler.log 2011-05-26 04:06:24,968 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 04:28:17,475 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 04:49:51,352 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 05:12:32,864 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 05:36:39,003 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 06:01:24,148 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 06:26:50,265 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 06:52:43,971 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 07:19:18,178 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 07:46:50,707 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 08:15:04,374 - DEBUG - JobPoller - Job Poller waiting 30s... 2011-05-26 08:43:14,664 - DEBUG - JobPoller - Job Poller waiting 30s...
Time Intervals extracted for ease of reading: 0:21:53 0:21:34 0:22:41 0:24:07 0:24:45 0:25:26 0:25:53 0:26:35 0:27:32 0:28:14 So definitly still seeing the decay.
Can we get the rpm -qa output from bifrost again now that it's updated?
vmrepo.phys.uvic.ca:/var/www/html/vms/bifrost_rpm_qa_20110526.out
mhpx wrote:
Can we get the rpm -qa output from bifrost again now that it's updated?
only thing that really stood out to me in the diff was the use of the java openjdk vs the one from sun... the workspace calls to Nimbus do use the jvm.
Problem with performance has been identified in the logging module and that multiple threads are vying for use of the log, the log is thread safe and uses locks to ensure so. CS is spending a ton of time waiting on locks to access the log. I'm going to work on a better way to deal with logging, in the meantime - lower the log level down to INFO or WARNING and it should improve the performance since it won't be trying to log as much.
Good spotting.
Mike I would say that the best bet here is to advise that we can't do verbose logging for 100 K jobs. The danger is that we try to improve on the python standard logging and we don't do a good enough job. Is there some drop in replacement for Logger that would improve this.
I've added some warning messages to the cloud_scheduler.conf in the logging section, and also on startup if its on verbose or debug logging to be mindful of how many jobs are being submitted. I'm still looking for an alternative way to handle the multithreaded logging better, tried using a socketserver setup but it had similar problems.
I tried to get a picture of the effect on performance of the change in logging level - CS is really quiet at the INFO level if no one is submitting new job types. At 6:56AM June 7, I submitted 100 jobs to a fully occupied cluster. The first logging messages in /var/log/cloudscheduler.log showed up at 9:07AM - here's a representative snippet:
2011-06-07 09:07:11,333 - INFO - Cleanup - Returning resources used by VM 3434 to Cluster hermes-westgrid 2011-06-07 09:07:12,011 - INFO - Cleanup - Destroyed vm 3434 on synnefo.westgrid.ca 2011-06-07 09:07:18,980 - INFO - Scheduler - Started vm 3568 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/canfarbase_i386_seb_cadc.img.gz 2011-06-07 09:14:46,997 - INFO - Cleanup - Returning resources used by VM 3440 to Cluster hermes-westgrid 2011-06-07 09:14:47,781 - INFO - Cleanup - Destroyed vm 3440 on synnefo.westgrid.ca 2011-06-07 09:14:54,724 - INFO - Scheduler - Started vm 3569 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/canfarbase_i386_seb_cadc.img.gz
...
The timestamp on the cloud_scheduler.pid file is June 3 11:01, so CS has been running for 5 days, and is showing about a 2 hour lag in response time, logging only at the INFO level.
There are currently 1623 jobs total in the condor queue.
Speaking of logging levels - John and I were talking about the necessary logging messages - here's a set of suggestions of what to see in INFO level log messages:
Another performance test:
Is the slow response limited to starting VMs?
Added to the previous comment - the last of 86 VMs was replaced at this time:
2011-06-09 07:22:58,985 - INFO - Cleanup - Returning resources used by VM 1550 to Cluster cadc 2011-06-09 07:22:59,605 - INFO - Cleanup - Destroyed vm 1550 on iris.cadc.dao.nrc.ca 2011-06-09 07:23:02,246 - INFO - Scheduler - Started vm 1551 on iris.cadc.dao.nrc.ca using image at http://vmrepo.phys.uvic.ca/vms/gwyn-50000000000052243.img.gz
More performance timing:
I submitted a single job as User 2 at 17:30, June 9, to a system fully occupied by jobs from User1. CS logs the first attempt to shut down the VMs of User1 to make room for this job at 23:58. There are currently 1321 jobs in the queue. Here's the relevant snippet of the log file:
2011-06-09 07:59:24,814 - INFO - Cleanup - Destroyed vm 3997 on synnefo.westgrid.ca 2011-06-09 07:59:27,790 - INFO - Scheduler - Started vm 4056 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/gwyn-50000000000052243.img.gz 2011-06-09 23:58:19,920 - INFO - Cleanup - Returning resources used by VM 3826 to Cluster hermes-westgrid 2011-06-09 23:58:20,921 - INFO - Cleanup - Destroyed vm 3826 on synnefo.westgrid.ca 2011-06-09 23:58:27,496 - INFO - Scheduler - Started vm 4057 on synnefo.westgrid.ca using image at https://www4.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/vospace/CADCAuthtest1/SGOTEST/vmRepositoryPrivate/fabbros-50000000000060478.img.gz
I duplicated the good performance on a job removal as well.
condor_rm at 6:44:47 am, June 10, CS logged the shutdown of the VM at 6:44:58 am. Here's the log:
2011-06-10 06:42:23,890 - INFO - Scheduler - Started vm 4153 on synnefo.westgrid.ca using image at https://www4.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/vospace/CADCAuthtest1/SGOTEST/vmRepositoryPrivate/fabbros-50000000000060478.img.gz 2011-06-10 06:44:58,634 - INFO - Cleanup - Returning resources used by VM 4153 to Cluster hermes-westgrid 2011-06-10 06:44:59,306 - INFO - Cleanup - Destroyed vm 4153 on synnefo.westgrid.ca 2011-06-10 06:45:02,689 - INFO - Scheduler - Started vm 4154 on synnefo.westgrid.ca using image at http://vmrepo.phys.uvic.ca/vms/gwyn-50000000000052243.img.gz
Interesting that with the logging turned down cleanup and job poller threads are performing better and scheduler is suffering still/more. I've made a separate issue for the info logging messages and will put in those changes shortly. - https://github.com/hep-gc/cloud-scheduler/issues/191 I'm also still delving deeper into how the logging module works and exploring alternate handlers that may perform better in a threaded application.
Cloud Scheduler retires an existing VM when it recognizes a new VM Type, and waits for condor to gracefully shut down that job. Is that correct? Does it log the retirement? At what level?
When does CS first log an action when it recognizes a new VM Type? It's entirely possible that I'm confusing the logging timestamps with performance.
Should we try to get profiling of CS working on bifrost? Or would you rather concentrate on the changes to the info logging messages?
mhpx wrote:
Interesting that with the logging turned down cleanup and job poller threads are performing better and scheduler is suffering still/more. I've made a separate issue for the info logging messages and will put in those changes shortly. I'm also still delving deeper into how the logging module works and exploring alternate handlers that may perform better in a threaded application.
It logs when it first executes the condor_off (retire) command, and then again when the Vm is being shutdown, both those are debug messages. There is no direct message that says "This vmtype is new" the closest is in the verbose logging of the JobPoller when it adds new jobs to the queue the new types will get added at that time.
Let me do the logging changes - I've also made a change to the scheduling algorithm to make it finish it's cycle faster. Once those changes are in place and installed on bifrost then it may help to get profiling output working.
This morning's test (June 16) - CS has been running for 22 hours :
Job submission at 6:31:40 AM, here's the job recognition by CS from the log at 6:32 AM:
2011-06-16 06:32:07,550 - INFO - JobPoller - goliaths-test1 vmtype added to required types 2011-06-16 06:32:12,964 - WARNING - Scheduler - Error creating VM goliaths-test1:
That looks pretty promising.
However, every creation by CS is currently resulting in error messages like this, "Problem: Resource request denied: Error creating workspace(s): No resource is available for this request (based on memory).", so comparing performance is questionable.
June 20 test - CS has been running since June 15:
Job submission (from command line):
Submitting job(s). 1 job(s) submitted to cluster 1041. Mon Jun 20 08:27:02 PDT 2011
Job recognition (from CS log):
2011-06-20 08:27:10,283 - INFO - JobPoller - goliaths-test1 vmtype added to required types
is that on bifrost or the other machine with CS? because an 8 second response give or where it is in the scheduling / polling intervals seems fairly reasonable.
bifrost.
mhpx wrote:
is that on bifrost or the other machine with CS? because an 8 second response give or where it is in the scheduling / polling intervals seems fairly reasonable.
Closing issue - was related to the logging service bogging down system. Some improvements made to help performance.
When CS starts up on the CADC bifrost installation, the interval between Job Poller cyles is around 30s. The length of this interval gradually decays.
Identified intervals:
CS immediately after restart - 40s restart:
CS running for 16 hours - interval 16 minutes:
CS running for 6 days - interval 3.5 hrs:
CS running for n days - interval up to 5 hrs:
CS running for n days + 5 - interval up to 7 hrs:
condor_retrieval_method is set to "local"
Characteristics of bifrost (condor central manager):
Random load check during the 5 hr delay event:
Processor information:
Here's the info for processor 0: