mbraceproject / MBrace.Azure

Microsoft Azure PaaS implementation for MBrace
http://mbrace.io
Apache License 2.0
46 stars 23 forks source link

First computation on a new cluster is extremely slow #126

Open isaacabraham opened 8 years ago

isaacabraham commented 8 years ago

I've observed this behaviour and it's reproduceable. I don't know (yet) if it only occurs with CloudFlow, but here's the repro.

  1. Execute this code
[ "http://publicdata.landregistry.gov.uk/market-trend-data/price-paid-data/b/pp-2013-part1.csv" ]
|> CloudFlow.OfHttpFileByLine
|> CloudFlow.length
  1. Observe timing
  2. Execute it again
  3. Observe timing

The first call is significantly slower than the second. I can repeat this consistently, but only on a brand new cluster. After that, resetting FSI makes no difference.

So if you do it three times, you'll see timings like this: -

       806b043b-90a0-4750-87d5-56a4845e6852  Completed  00:00:33.8204630    0 /   0 /   3 /   3  int64                     10/11/2015 22:57:48  10/11/2015 22:58:22 
       8c041273-edf1-4f49-af4f-5b0073b72064  Completed  00:00:04.5669221    0 /   0 /   3 /   3  int64                     10/11/2015 22:59:05  10/11/2015 22:59:10 
       e6d89094-6a2c-45d7-b707-c055a5923ea7  Completed  00:00:04.8601043    0 /   0 /   3 /   3  int64                     10/11/2015 22:59:25  10/11/2015 22:59:29 

On one occasion I even saw a different number of work items for the "first" computation than the second and third calls. This is on MBrace.Azure (16.1) using the Starter Kit.

Probably we need some further investigation - it could be something like: -

  1. Azure "warming up" the IO of the cluster the first time?
  2. Something with the first computation (or CloudFlow) on MBrace?
  3. Something to do with network caching for subsequent calls (I tend to discount this though).
  4. Something else?
palladin commented 8 years ago

Very interesting question for investigation. I have seen some very weird ~20 sec slowdown only for the first service bus enqueue call.

dsyme commented 8 years ago

This is probably just some Azure weirdness right? e.g. taking time to put VMs in the right places or something?

eiriktsarpalis commented 8 years ago

I think so too. I think we need to test whether this slowdown also occurs when the first computation is a simple hello world.

eiriktsarpalis commented 8 years ago

Ok, I think I know what's going on here. On a fresh Azure cluster, the following computation

cluster.CreateProcess(Cloud.ParallelEverywhere(cloud { return 42}))

Takes 30 seconds on the first run and 2 seconds on the second. Also, running Isaac's example after that results in normal execution times. I think that this is attributable to AppDomain isolation in worker instances. Worker AppDomains are lazily instantiated and these can take quite a bit of time to start up. You can actually observe this by looking at individual worker logs:

cluster.Workers.[0].ShowSystemLogs()

which upon inspection reveal:

[2015-11-12 13:29:54] INFO : Dequeued work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869'.
[2015-11-12 13:29:54] INFO : Concurrent work item count increased to 1/32.
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Azure.Common.NetFramework, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.Compute, Version=12.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Azure.Common, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Hyak.Common, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Streams, Version=0.4.1.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'MBrace.Azure.Management, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'FSharp.Compiler.Interactive.Settings, Version=4.4.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
[2015-11-12 13:29:54] INFO : Downloading 'FSI-ASSEMBLY_72050b76-f0d5-4379-9cb2-80368711bae3_1, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.Storage, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'MBrace.Flow, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.ServiceBus, Version=0.9.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Threading.Tasks, Version=1.0.12.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
[2015-11-12 13:29:57] INFO : TopicMonitor : starting topic maintenance.
[2015-11-12 13:29:57] INFO : Initializing Application Domain "a24819cc-4311-412e-be04-f48249618a63"
[2015-11-12 13:29:58] INFO : TopicMonitor : maintenance complete.
[2015-11-12 13:30:04] INFO : Initializing AppDomain ClusterManager
[2015-11-12 13:30:08] INFO : TopicMonitor : starting topic maintenance.
[2015-11-12 13:30:08] INFO : TopicMonitor : maintenance complete.
[2015-11-12 13:30:11] INFO : Starting cloud process '3a9d0577-24f6-4bad-8381-42d47948a658' of type 'int'.
[2015-11-12 13:30:13] INFO : Completed work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869' after 00:00:01.1328427

This indicates that the first work item dequeue can take up to 20 seconds before it actually completes, even if it is a trivial cloud { return 42 }. You can also see that the actual execution took more than 1 second to complete. If I'm not mistaken, worker roles use the legacy 64-bit JIT compiler, which is known to be slow.

isaacabraham commented 8 years ago

Any way we can force each worker to eagerly start up and instantiate the appdomain?

eiriktsarpalis commented 8 years ago

I'm not sure this would help much. Thespian, which uses the same underlying implementation, does not demonstrate the same degree of delay when sending a first computation. So it seems likely that this is caused by the time it takes to download the initial dependency set or perhaps some other Azure issue?

isaacabraham commented 8 years ago

Either that, or the configuration of the CLR on Azure machines e.g. 64-bit JIT - has some issue which should be handled differently. I don't know. It would be nice if we could know for sure what the issue is - even if it's something that can't be fixed at this time.

isaacabraham commented 8 years ago

How can I progress this - if it means going to the Azure team somewhere, fine - what's the next step in closing this issue out?

eiriktsarpalis commented 8 years ago

I'm not entirely convinced this is necessarily an Azure issue. As indicated in my earlier measurements the delay seems to be happening during the initial assembly/data download (TopicMonitor log outputs come from an independent entity and not related to that particular dequeue).

palladin commented 8 years ago

I agree with Eirik's remarks. Appdomain initialization is the big delay factor.

[2015-11-12 13:29:57] INFO : Initializing Application Domain "a24819cc-4311-412e-be04-f48249618a63"
[2015-11-12 13:30:04] INFO : Initializing AppDomain ClusterManager
[2015-11-12 13:30:11] INFO : Starting cloud process '3a9d0577-24f6-4bad-8381-42d47948a658' of type 'int'.
[2015-11-12 13:30:13] INFO : Completed work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869' after 00:00:01.1328427
isaacabraham commented 8 years ago

OK. So my original question was - can we not somehow eagerly initialize the app domain e.g. send a dummy computation immediately to all nodes as part of the initialization process? Also - why don't we see the same behaviour with Thespian?

palladin commented 8 years ago

I think it is adhoc to eagerly initialiaze appdomains with dummy computations. Btw Hadoop/Spark exhibit also the same warm-up behaviour.

isaacabraham commented 8 years ago

@palladin why ad-hoc. And - maybe spark / hadoop exhibits this, but so what :-)