Discussion thread: cloud Process and cloud Task creation perf

dsyme commented 9 years ago

This is to start a discussion about the perf of process creation. I'd like to understand what the limiting factor is - is it response time from writing to Azure? Or network download speed? Size of uploaded data + code? Or performance of Vagabond?

The script I am using is this - starting 100 processes one by one. When run from my laptop over mediocre WiFi it takes ~20seconds to start each process. Data centre is West Europe.

let ps = 
 [for i in 0 .. 100 ->
   printfn "starting %d, time = %A" i System.DateTime.Now.TimeOfDay
   cloud { let t = System.DateTime.Now
           let x = ref 0
           do while System.DateTime.Now - t < System.TimeSpan.FromMinutes(2.0) do
                for i in 0 .. 10000 do
                    for j in 0 .. 10000 do x := !x + 1 
         }
    |> cluster.CreateProcess ]
cluster.ShowProcesses()

Here is the output

starting 0, time = 12:28:19.1499964
starting 1, time = 12:28:41.9689327
starting 2, time = 12:29:02.4742978
starting 3, time = 12:29:22.3268908
starting 4, time = 12:29:42.4681581
starting 5, time = 12:30:03.4127964
starting 6, time = 12:30:23.8800452
starting 7, time = 12:30:44.5387323
starting 8, time = 12:31:05.4803253
starting 9, time = 12:31:26.7159774
starting 10, time = 12:31:48.4095294
starting 11, time = 12:32:11.5083444
starting 12, time = 12:32:33.9698615

dsyme commented 9 years ago

After resetting F# Interactive I get much faster process creation time. I assume this is because the overall code size of the dynamic DLL and its dependencies is much smaller. It would be great if we could

let the user know when the process uploads are getting big or slow, suggesting to restart F# interactive
optimize process creation/upload so this is not a problem quite so soon

starting 0, time = 12:33:56.1740424
starting 1, time = 12:34:04.4120094
starting 2, time = 12:34:06.2991333
starting 3, time = 12:34:08.0335563
starting 4, time = 12:34:09.7631435
starting 5, time = 12:34:11.6191563
starting 6, time = 12:34:13.3961138
starting 7, time = 12:34:15.1474590

dsyme commented 9 years ago

Actually, looking at that example progress it seems that process creation gets linearly slower, in the example after a rest we've gone from 2sec/process to 8sec/process over 0 .. 100 processes

starting 0, time = 12:33:56.1740424
starting 1, time = 12:34:04.4120094
starting 2, time = 12:34:06.2991333
starting 3, time = 12:34:08.0335563
starting 4, time = 12:34:09.7631435
starting 5, time = 12:34:11.6191563
starting 6, time = 12:34:13.3961138
starting 7, time = 12:34:15.1474590
starting 8, time = 12:34:17.0073365
starting 9, time = 12:34:18.7474529
starting 10, time = 12:34:20.5928775
starting 11, time = 12:34:22.4557389
starting 12, time = 12:34:24.2526437
starting 13, time = 12:34:26.0968364
starting 14, time = 12:34:27.9718738
starting 15, time = 12:34:29.7174930
starting 16, time = 12:34:31.5414533
starting 17, time = 12:34:33.4445446
starting 18, time = 12:34:35.1945900
starting 19, time = 12:34:37.0846583
starting 20, time = 12:34:38.9818311
starting 21, time = 12:34:40.8473459
starting 22, time = 12:34:42.7066732
starting 23, time = 12:34:44.6674521
starting 24, time = 12:34:46.5501728
starting 25, time = 12:34:48.3729118
starting 26, time = 12:34:50.1865610
starting 27, time = 12:34:52.0499914
starting 28, time = 12:34:54.3771508
starting 29, time = 12:34:56.2208669
starting 30, time = 12:34:58.2592608
starting 31, time = 12:35:00.4852576
starting 32, time = 12:35:02.5398938
starting 33, time = 12:35:04.8228515
starting 34, time = 12:35:07.0109156
starting 35, time = 12:35:09.1805417
starting 36, time = 12:35:11.3154956
starting 37, time = 12:35:13.5573854
starting 38, time = 12:35:16.0288518
starting 39, time = 12:35:18.2471167
starting 40, time = 12:35:20.3929989
starting 41, time = 12:35:23.2298295
starting 42, time = 12:35:25.8215989
starting 43, time = 12:35:27.9879812
starting 44, time = 12:35:30.2743372
starting 45, time = 12:35:33.0669830
starting 46, time = 12:35:35.5357341
starting 47, time = 12:35:38.6288490
starting 48, time = 12:35:41.7401289
starting 49, time = 12:35:45.1042800
starting 50, time = 12:35:48.3119449
starting 51, time = 12:35:50.9406286
starting 52, time = 12:35:54.5161755
starting 53, time = 12:35:57.4131168
starting 54, time = 12:36:01.3590098
starting 55, time = 12:36:04.9073392
starting 56, time = 12:36:08.5085063
starting 57, time = 12:36:13.1590636
starting 58, time = 12:36:18.1347959
starting 59, time = 12:36:23.0005032
starting 60, time = 12:36:28.3485034
starting 61, time = 12:36:33.9423014
starting 62, time = 12:36:39.5309374
starting 63, time = 12:36:45.7493232
starting 64, time = 12:36:51.6717347
starting 65, time = 12:36:57.6283406
starting 66, time = 12:37:04.5528208
starting 67, time = 12:37:11.2820965
starting 68, time = 12:37:18.3310534
starting 69, time = 12:37:25.7669861
starting 70, time = 12:37:33.0187962
starting 71, time = 12:37:39.7912644
starting 72, time = 12:37:48.4834902
starting 73, time = 12:37:56.6796231
starting 74, time = 12:38:04.8198344
starting 75, time = 12:38:13.6062867
starting 76, time = 12:38:21.3723419
starting 77, time = 12:38:29.1911691
starting 78, time = 12:38:37.5560858
starting 79, time = 12:38:45.6741036
starting 80, time = 12:38:53.9707021
starting 81, time = 12:39:02.1573171
starting 82, time = 12:39:10.2143203
starting 83, time = 12:39:18.5163225
starting 84, time = 12:39:27.4843666
starting 85, time = 12:39:37.0146024
starting 86, time = 12:39:46.1383409
starting 87, time = 12:39:55.0477290
starting 88, time = 12:40:04.2203387
starting 89, time = 12:40:13.5567591
starting 90, time = 12:40:23.3349220
starting 91, time = 12:40:33.7948666
starting 92, time = 12:40:43.7948585

dsyme commented 9 years ago

Here's a non-CPU perf test dedicated to process creation alone. Process creation time definitely seems to be increasing, from 1.5s initially to 4.5s by process 130


#load "credentials.fsx"

open MBrace
open MBrace.Azure
open MBrace.Azure.Client
open MBrace.Azure.Runtime
open MBrace.Streams
open MBrace.Workflows
open Nessos.Streams

let cluster = Runtime.GetHandle(config)

let ps = 
 [for i in 0 .. 10000 ->
   printfn "starting %d, time = %A" i System.DateTime.Now.TimeOfDay
   cloud { return System.DateTime.Now }
    |> cluster.CreateProcess ]

cluster.ShowProcesses()

results:

starting 0, time = 15:38:19.5428179
starting 1, time = 15:38:24.5260675
starting 2, time = 15:38:26.0988331
starting 3, time = 15:38:27.5830810
starting 4, time = 15:38:28.9807745
starting 5, time = 15:38:30.4258215
starting 6, time = 15:38:31.8632119
starting 7, time = 15:38:33.6095282
starting 8, time = 15:38:35.0720985
starting 9, time = 15:38:36.5526525
starting 10, time = 15:38:38.4201400
starting 11, time = 15:38:40.6917913
starting 12, time = 15:38:42.6135106
starting 13, time = 15:38:44.0821831
starting 14, time = 15:38:45.8938804
starting 15, time = 15:38:47.3714544
starting 16, time = 15:38:48.8149324
starting 17, time = 15:38:50.5641437
starting 18, time = 15:38:52.1015067
starting 19, time = 15:38:53.5644116
starting 20, time = 15:38:55.3763897
starting 21, time = 15:38:56.7573365
starting 22, time = 15:38:58.2736676
starting 23, time = 15:38:59.6876319
starting 24, time = 15:39:01.2187318
starting 25, time = 15:39:02.6717460
starting 26, time = 15:39:04.2120773
starting 27, time = 15:39:05.6896289
starting 28, time = 15:39:07.1895175
starting 29, time = 15:39:08.7120355
starting 30, time = 15:39:10.2980408
starting 31, time = 15:39:11.7631356
starting 32, time = 15:39:13.2473890
starting 33, time = 15:39:14.7457471
starting 34, time = 15:39:16.3237760
starting 35, time = 15:39:17.8002834
starting 36, time = 15:39:19.6652890
starting 37, time = 15:39:21.5326729
starting 38, time = 15:39:23.0111176
starting 39, time = 15:39:24.5120940
starting 40, time = 15:39:26.1057243
starting 41, time = 15:39:27.8359697
starting 42, time = 15:39:29.3783699
starting 43, time = 15:39:30.7653529
starting 44, time = 15:39:32.3395603
starting 45, time = 15:39:33.8033213
starting 46, time = 15:39:35.4859990
starting 47, time = 15:39:37.2358454
starting 48, time = 15:39:38.7889495
starting 49, time = 15:39:40.9683499
starting 50, time = 15:39:42.5909600
starting 51, time = 15:39:44.5449494
starting 52, time = 15:39:46.0917072
starting 53, time = 15:39:48.0748377
starting 54, time = 15:39:49.6694371
starting 55, time = 15:39:51.3099201
starting 56, time = 15:39:52.9506132
starting 57, time = 15:39:54.6390500
starting 58, time = 15:39:56.3299709
starting 59, time = 15:39:58.0173441
starting 60, time = 15:40:00.1448355
starting 61, time = 15:40:01.7540545
starting 62, time = 15:40:03.3345620
starting 63, time = 15:40:04.8241609
starting 64, time = 15:40:06.5699846
starting 65, time = 15:40:08.1480958
starting 66, time = 15:40:10.3415189
starting 67, time = 15:40:12.0289974
starting 68, time = 15:40:13.6696487
starting 69, time = 15:40:15.9692188
starting 70, time = 15:40:17.7035990
starting 71, time = 15:40:19.5278973
starting 72, time = 15:40:21.3352820
starting 73, time = 15:40:23.1321873
starting 74, time = 15:40:24.8426341
starting 75, time = 15:40:26.7429318
starting 76, time = 15:40:28.4376178
starting 77, time = 15:40:30.2184060
starting 78, time = 15:40:32.1327741
starting 79, time = 15:40:33.8213911
starting 80, time = 15:40:35.5883353
starting 81, time = 15:40:37.2239112
starting 82, time = 15:40:39.2737737
starting 83, time = 15:40:40.9994437
starting 84, time = 15:40:42.9296346
starting 85, time = 15:40:44.8477168
starting 86, time = 15:40:46.8764260
starting 87, time = 15:40:49.3128753
starting 88, time = 15:40:51.4038596
starting 89, time = 15:40:53.9605577
starting 90, time = 15:40:56.4448594
starting 91, time = 15:40:59.1100627
starting 92, time = 15:41:01.5920981
starting 93, time = 15:41:03.6858926
starting 94, time = 15:41:05.9359114
starting 95, time = 15:41:08.0246339
starting 96, time = 15:41:10.6559258
starting 97, time = 15:41:14.4882101
starting 98, time = 15:41:17.6322688
starting 99, time = 15:41:20.6630999
starting 100, time = 15:41:24.2881066
starting 101, time = 15:41:27.4162476
starting 102, time = 15:41:30.1679075
starting 103, time = 15:41:34.1408608
starting 104, time = 15:41:37.7252841
starting 105, time = 15:41:40.4146591
starting 106, time = 15:41:43.4303217
starting 107, time = 15:41:46.7109135
starting 108, time = 15:41:50.3705243
starting 109, time = 15:41:55.0578714
starting 110, time = 15:42:00.5158305
starting 111, time = 15:42:03.8967464
starting 112, time = 15:42:09.7890897
starting 113, time = 15:42:16.3642873
starting 114, time = 15:42:21.7589975
starting 115, time = 15:42:25.2824581
starting 116, time = 15:42:28.9972854
starting 117, time = 15:42:36.2516765
starting 118, time = 15:42:40.1982663
starting 119, time = 15:42:43.9775212
starting 120, time = 15:42:48.5368654
starting 121, time = 15:42:53.0956885
starting 122, time = 15:42:57.6216382
starting 123, time = 15:43:02.1396302
starting 124, time = 15:43:06.6441231
starting 125, time = 15:43:12.6482058
starting 126, time = 15:43:16.3716512
starting 127, time = 15:43:20.1575152
starting 128, time = 15:43:25.1534809
starting 129, time = 15:43:30.7941507
starting 130, time = 15:43:34.8670080
starting 131, time = 15:43:39.7666770
starting 132, time = 15:43:46.4711275
starting 133, time = 15:43:51.6147317
starting 134, time = 15:43:57.4389597
starting 135, time = 15:44:01.8354897

palladin commented 9 years ago

Yes, I see also the same behaviour. One useful test is to enable the logging mechanism with

runtime.AttachClientLogger(new Common.ConsoleLogger())

palladin commented 9 years ago

Based on my tests, I see that the majority time is spent to 02032015 18:00:46.416 +02:00 Submit process 9a88ac8bf35b4495a1d13c2ae3d4fb85.

krontogiannis commented 9 years ago

The submit process step

Creates a process record in table store.
Creates a cancellation token in table store.
Creates a result cell in table store.
Serializes the workflow.
Uploads workflow in blob.
Posts a message in service bus.

I don't see how any of the steps (except 4 maybe) have very different execution times (unless it's a network issue).

I will add some more detailed logging in the submit process step.

eiriktsarpalis commented 9 years ago

@dsyme Vagabond is almost certainly a factor that affects process start time. Apart from any assemblies, it needs to serialize and upload all top-level value bindings made in F# interactive. If they happen to be large, process creation will be delayed. In theory, every value is to be uploaded once, so an expanding assembly should not degrade performance in that respect.

dsyme commented 9 years ago

@eiriktsarpalis Yes, I expected that from vagabond. However in the given example the process creation appears to get slower even though there are no additional bindings at the top level.

@krontogiannis Additional logging sounds good

dsyme commented 9 years ago

Let's also use this thread to discuss task en-queuing and de-queuing perf. I'm a bit surprised at the overheads here. They may be the same cause as above - but are the overheads because of the queue or because of serialization?

For this first example below, I get one job created and de-queued per second - 11 in all in the 10 seconds allotted - this is on a 10-machine cluster

let rec ps(t,e) = 
 cloud { if e && System.DateTime.Now - t < System.TimeSpan.FromSeconds(10.0) then
             let! a,b = ps(t,true) <||> ps(t,false)
             return a+b
         else
             return 1 }

let job = 
   cloud { return! ps(System.DateTime.Now,true) }
     |> cluster.CreateProcess

In this next example it takes 74 seconds to create and enqueue 201 jobs and wait for them:

let  ps() = 
 cloud { let tasks = new ResizeArray<_>()
         for i in [ 0 .. 200 ] do 
             let! x = Cloud.StartAsCloudTask (cloud { do! Cloud.Sleep 1000
                                                      return 1 })
             tasks.Add x
         for t in tasks.ToArray() do 
             let! res = t.AwaitResult()
             ()
        }

let job = 
   cloud { return! ps() }
     |> cluster.CreateProcess

Here's typical utilization

 MBraceWorkerRole_IN_0  RD0003FF4BE325    0.35 / 2       32.91 / 3583.00        532.31 / 703.59     0 / 2        3064  02/03/2015 22:44:41 +00:00  05/03/2015 19:01:12 +00:00  True      
 MBraceWorkerRole_IN_9  RD0003FF4BC9F8    0.15 / 2       23.00 / 3583.00        363.54 / 475.67     0 / 2        3456  05/03/2015 17:08:27 +00:00  05/03/2015 19:01:12 +00:00  True      
 MBraceWorkerRole_IN_6  RD0003FF4B8644    3.20 / 2       22.89 / 3583.00        464.24 / 590.81     1 / 2        3832  05/03/2015 17:08:36 +00:00  05/03/2015 19:01:12 +00:00  True      
 MBraceWorkerRole_IN_4  RD0003FF4BF0B5    4.62 / 2       24.03 / 3583.00       636.97 / 1230.20     1 / 2        1448  05/03/2015 17:08:41 +00:00  05/03/2015 19:01:13 +00:00  True      
 MBraceWorkerRole_IN_7  RD0003FF4BBA94    6.12 / 2       23.28 / 3583.00       625.92 / 1231.77     1 / 2        3136  05/03/2015 17:08:42 +00:00  05/03/2015 19:01:13 +00:00  True      
 MBraceWorkerRole_IN_2  RD0003FF4B8A10    0.34 / 2       23.03 / 3583.00        343.08 / 393.35     0 / 2        3536  05/03/2015 17:08:42 +00:00  05/03/2015 19:01:12 +00:00  True      
 MBraceWorkerRole_IN_8  RD0003FF4BC24A    9.96 / 2       23.95 / 3583.00       1254.65 / 1833.27    2 / 2        3760  05/03/2015 17:08:43 +00:00  05/03/2015 19:01:13 +00:00  True      
 MBraceWorkerRole_IN_5  RD0003FF4BE90C    1.55 / 2       23.97 / 3583.00        463.62 / 571.30     0 / 2        3732  05/03/2015 17:08:45 +00:00  05/03/2015 19:01:12 +00:00  True      
 MBraceWorkerRole_IN_1  RD0003FF4BE904    1.83 / 2       24.06 / 3583.00        513.97 / 638.26     0 / 2        2292  05/03/2015 17:08:50 +00:00  05/03/2015 19:01:12 +00:00  True      
 MBraceWorkerRole_IN_3  RD0003FF4BBC64    0.52 / 2       26.01 / 3583.00        463.11 / 558.55     0 / 2         836  05/03/2015 17:08:53 +00:00  05/03/2015 19:01:13 +00:00  True

krontogiannis commented 9 years ago

Surely performance tuning is something that we will look into at some point. Some things to note:

During Enqueue the only thing that is optimized is Parallel and Choice as all of child jobs are enqueued in a single batch, instead of multiple StartChild that result in multiple Service Bus enqueue calls. So i would expect a Parallel with 200 jobs to have a much faster enqueue.
Worker dequeue fetches only one job at a time and has to check two queues for messages (the generic job queue and a queue for messages with affinity).
The maximum number of jobs that a worker executes concurrently is determined by Service.MaxConcurrentJobCount. Even if there are jobs in Service Bus, if there is no available slot they won't be executed.
No, serialization should not be an issue for the workflows above. When creating a job multiple calls to table storage are done to create cancellation tokens, result cells, etc. Most of these could probably be done in a single batch. Another thing is that job's payload by design is stored in blobs instead of the Service Bus message itself, so there is this latency issue.

So these are the issues I know of. I've been experimenting in my jobQueue branch with the above but I doubt if any changes can make it to the next release. And then we should find issues that I'm not aware of :).

dsyme commented 9 years ago

Somewhat tangential, but this does raise some questions about profiling on the target machines (which could be used to find out what's going on here). Presumably the Visual Studio profiling tools can connect to these machines and attach to these remote processes (likewise the debugger). I don't do much remote debugging and very little remote profiling but at some point it would be good to document what steps are needed to get a clean story here.

eiriktsarpalis commented 9 years ago

@dsyme On a side note, it is definitely possible to attach the debugger/profiler to MBrace worker roles provisioned from within Visual Studio. It would be interesting to see if this is also supported by Brisk-provisioned clusters. @isaacabraham is this a possibility?

dsyme commented 9 years ago

@krontogiannis It looks like the 100ms timeout delay here on each job-receive (while the worker waits for a control message) would contribute to the perf results seen above. If so, I think this could be written differently,, so that we don't have to actually pause while checking for control messages, e.g. instead do a merge of the events from ReceiveAsync on the queue and the control messages (Update etc.) coming from the Worker role apparatus?

Other observations

MaxConcurrentJobs seems to be set very low (certainly for a bunch of jobs that involve Cloud.Sleep...) AFAIK the .NET thread pool and task scheduler scales threads up and down as needed and puts sleeping jobs into the I/O completions (so though won't count them as active concurrent jobs). It does sort of raise the question as to whether the Azure worker implementation is trying to outsmart the
I reckon that Cloud.Logf may be causing a raw CPU performance hit since it is used quite often. Basically, the partial application of ksprintf is not cheap (indeed was really expensive in F# 3.1.0.0 and before). At least we should profile and check this.

isaacabraham commented 9 years ago

@eiriktsarpalis there should be no reason why that won't work aside from maybe existing the ports on the load balancer to individual nodes.

@dsyme there's another discussion somewhere here regarding the max job count - I agree its too low etc. The current value is set at something ridiculously low like = number of cores in the worker, which assumes that all workflows will only be limited by CPU, It could be much higher than that. Other reasons for a workflow being delayed includes IO, or simple thread blocking / delay.

As an example of you create some workflows that just does a Thread.Sleep, on the local machine these can happily run in parallel in their hundreds or thousands.

krontogiannis commented 9 years ago

Sure, MaxJobCount logic must be changed, but in order to do this we need to introduce some kind of scheduling, or more advanced dequeue logic.

At the early stages MJC was used to control the degree of parallelism, specially for CloudStream workflows, when there was no support for controlling where your jobs would execute.

dsyme commented 9 years ago

Is it possible to set MaxConcurrentJobs for workers programmatically? I looked around and didn't spot how. Thanks!

Thanks also for pointing out that it's possible to specify an IWorkerRef explicitly, I hadn't played with that as yet.

krontogiannis commented 9 years ago

At the moment its is possible to set this only in the workerrole code, when the worker starts. Eventually you will be able to have some control over this from your client.
Yeah, all of the distribution combinators take have this IWorkerRef overload. Pretty useful to implement scheduling in user space code.

dsyme commented 9 years ago

Re IWorkerRef - am I right in thinking that if the Worker specified is dead or exiting or stopped, then the work will still get scheduled elsewhere?

krontogiannis commented 9 years ago

No, jobs with affinity will either execute in given worker (if worker dies and restarts then the given FaultPolicy will be applied, job might retry, etc.). If the worker is stopped we have decided to raise a FaultException.

Note that the later case is not currently implemented, but that's our plan. It might be useful to give some tolerance on this, e.g. some kind of policy if executing with affinity fails, do we want to retry on any other machine or a FaultException is good.

dsyme commented 9 years ago

For 0.6.1-alpha the process start time is now steady, which is great

@krontogiannis, above you outlined the 6 steps for process creation.

The submit process step

Creates a process record in table store.

Creates a cancellation token in table store.

Creates a result cell in table store.

Serializes the workflow.

Uploads workflow in blob.

Posts a message in service bus.

I wonder, can the "CreateProcess" call return before all the steps 1-6 have completed, i.e. could the remaining steps be run asynchronously and concurrently on the local machine? Or is it somehow logically necessary to actually make the process manifest in the Azure storage before the Process object is returned to the user?

Speaking of which, is the MBrace job-creation-and-management API expected to be thread safe - could I fire off CreateProcess calls as background operations?

isaacabraham commented 9 years ago

Also - appreciate this might already be the case - can any of those steps be parallelised e.g. writes to table storage? Also I suspect that the serialization and upload is probably the biggest cost here.

dsyme commented 9 years ago

I'd say it's all network costs - in the example below FSI.EXE on my local machine runs at 3% CPU.

let ps = 
 [let time = ref System.DateTime.Now.TimeOfDay
  for i in 0 .. 10000 ->
   let newTime = System.DateTime.Now.TimeOfDay
   printfn "starting %d, diff = %A" i (newTime - time.Value)
   time := newTime
   cloud { return System.DateTime.Now }
    |> cluster.CreateProcess ]

dsyme commented 9 years ago

Closing this as its an old discussion

mbraceproject / MBrace.Azure

Discussion thread: cloud Process and cloud Task creation perf #25