mbraceproject / MBrace.Core

MBrace Core Libraries & Runtime Foundations
http://mbrace.io/
Apache License 2.0
211 stars 46 forks source link

Unexpected crash handling medium CloudFile #98

Closed mathias-brandewinder closed 9 years ago

mathias-brandewinder commented 9 years ago

The datafile contains 42,000 images, and can be found here: http://1drv.ms/1GnHRRA The cloudValidation process faults every time, with the following: Nessos.FsPickler.FsPicklerException: Error serializing object of type 'MBrace.Azure.Runtime.PickledJob'. ---> System.IO.IOException: There is not enough space on the disk. (details below)

#load "credentials.fsx"

open MBrace.Core
open MBrace.Azure
open MBrace.Azure.Client
open MBrace.Store
open MBrace.Workflows
open MBrace.Flow

let dataPath = __SOURCE_DIRECTORY__ + @"../../../data/"
let fullDataPath = dataPath + "large.csv"
let cluster = Runtime.GetHandle(config)

CloudFile.Delete("data/large.csv") |> cluster.RunLocally
//let large = CloudFile.Upload(fullDataPath,"data/large.csv") |> cluster.RunLocally
let large = CloudFile("data/large.csv")

// works
cloud { 
    let! data = CloudFile.ReadAllLines(large.Path) 
    return data.Length }
|> cluster.Run

type Image = int[]
type Example = { Label:int; Image:Image }

let parseLine (line:string) =
    let columns = line.Split ','
    let label = columns.[0] |> int
    let pixels = columns.[1..] |> Array.map int
    { Label = label; Image = pixels }

// works
System.IO.File.ReadAllLines fullDataPath
|> Array.map parseLine

// works
cloud { 
    let! data = CloudFile.ReadAllLines(large.Path) 
    let parsed = data |> Array.map parseLine
    return parsed.Length }
|> cluster.Run

cluster.AttachClientLogger(ConsoleLogger())

#time

let distance (img1:Image) (img2:Image) =
    let mutable total = 0
    let size = 28 * 28
    for i in 0 .. (size - 1) do
        let diff = img1.[i] - img2.[i]
        total <- total + abs diff //diff * diff
    total

let classifier (sample:Example[]) (image:Image) =
    sample
    |> Array.minBy(fun ex -> distance ex.Image image)
    |> fun x -> x.Label

let cloudValidation = 
    cloud {
        let! data = CloudFile.ReadAllLines(large.Path)
        let train = data.[..40000] |> Array.map parseLine
        let test = data.[40001..] |> Array.map parseLine
        let model = classifier train
        let! correct =
            test
            |> CloudFlow.OfArray
            |> CloudFlow.withDegreeOfParallelism 16
            |> CloudFlow.map (fun ex ->
                    if model ex.Image = ex.Label then 1.0 else 0.0)
            |> CloudFlow.average
        return correct}

// faults
let job = cloudValidation |> cluster.CreateProcess
job.Completed
job.AwaitResult ()

cluster.ShowProcesses ()

Added the stacktrace:

MBrace.Core.FaultException: Failed to execute job 'fe2eef6cfdf64eb28e375cd4e7a224f4' ---> Nessos.FsPickler.FsPicklerException: Error serializing object of type 'MBrace.Azure.Runtime.PickledJob'. ---> System.IO.IOException: There is not enough space on the disk.

   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.FileStream.WriteCore(Byte[] buffer, Int32 offset, Int32 count)
   at <StartupCode$FsPickler>.$FieldPicklers.writer@125-36.Invoke(WriteState w, String tag, T t) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\PicklerGeneration\FieldPicklers.fs:line 125
   at Nessos.FsPickler.CompositePickler`1.Write(WriteState state, String tag, T value) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\Pickler\CompositePickler.fs:line 202
   at recordSerializer(Pickler[] , WriteState , PickledJob )
   at <StartupCode$FsPickler>.$FSharpTypeGen.writer@191-44.Invoke(WriteState w, String tag, Record t) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\PicklerGeneration\FSharpTypeGen.fs:line 191
   at Nessos.FsPickler.CompositePickler`1.Write(WriteState state, String tag, T value) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\Pickler\CompositePickler.fs:line 202
   at Nessos.FsPickler.RootSerialization.writeRootObject[T](IPicklerResolver resolver, ReflectionCache reflectionCache, IPickleFormatWriter formatter, FSharpOption`1 streamingContext, Pickler`1 pickler, T value) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\FsPickler\RootSerialization.fs:line 38
   --- End of inner exception stack trace ---
   at Nessos.FsPickler.RootSerialization.writeRootObject[T](IPicklerResolver resolver, ReflectionCache reflectionCache, IPickleFormatWriter formatter, FSharpOption`1 streamingContext, Pickler`1 pickler, T value) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\FsPickler\RootSerialization.fs:line 41
   at Nessos.FsPickler.FsPicklerSerializer.Serialize[T](Stream stream, T value, FSharpOption`1 streamingContext, FSharpOption`1 encoding, FSharpOption`1 leaveOpen) in c:\Users\eirik\Development\nessos\FsPickler\src\FsPickler\FsPickler\Serializer.fs:line 47
   at <StartupCode$MBrace-Azure-Runtime>.$Queues.EnqueueBatch@149-2.Invoke(Tuple`2 _arg1) in C:\workspace\krontogiannis\MBrace.Azure\src\MBrace.Azure.Runtime\Primitives\Queues.fs:line 151
   at Microsoft.FSharp.Control.AsyncBuilderImpl.callA@803.Invoke(AsyncParams`1 args)
   --- End of inner exception stack trace ---

   at <StartupCode$MBrace-Azure-Client>.$Process.AwaitResultAsync@133-2.Invoke(Unit _arg4) in C:\workspace\krontogiannis\MBrace.Azure\src\MBrace.Azure.Client\Process.fs:line 133
   at Microsoft.FSharp.Control.AsyncBuilderImpl.args@787-1.Invoke(a a)
   at MBrace.Core.Internals.ExceptionDispatchInfoUtils.Async.RunSync[T](FSharpAsync`1 workflow, FSharpOption`1 cancellationToken) in c:\Users\eirik\Development\mbrace\MBrace.Core\src\MBrace.Core\Utils\ExceptionDispatchInfo.fs:line 139
   at <StartupCode$FSI_0023>.$FSI_0023.main@()
Stopped due to error
palladin commented 9 years ago

The interesting part is the "System.IO.IOException: There is not enough space on the disk." it is the same with https://github.com/mbraceproject/MBrace.Core/issues/96 @krontogiannis I think that we have seen it before the same exception (with the azure temp folder)?

palladin commented 9 years ago

The relevant code is https://github.com/mbraceproject/MBrace.Azure/blob/master/samples/MBrace.Azure.CloudService.WorkerRole/WorkerRole.cs#L33 @krontogiannis Is it possible that MS changed the folder structure/quota?

isaacabraham commented 9 years ago

Might this be to do with the size of the temp storage on each worker?

isaacabraham commented 9 years ago

@palladin OK - the Brisk implementation does the same thing (albeit with a different storage name) - we set that to a large size so there should be plenty of space there.

palladin commented 9 years ago

@isaacabraham ok so you have increased the temp space size. is it ready for @mathias-brandewinder to try again?

isaacabraham commented 9 years ago

it's already large - i think we set it to something silly like 40gb or something

mathias-brandewinder commented 9 years ago

FYI, just hit that exception again, on a different case :(

palladin commented 9 years ago

@mathias-brandewinder It looks like a systemic problem it has nothing to do with a particular use case. We need to invastigate if it is an Azure issue.

mathias-brandewinder commented 9 years ago

OK, great - so to speak! The good news is, it seems this is the root cause of a bunch of issues I ran into in the past few days. The bad news is, this is a pretty big issue, at least for me. Good luck!

palladin commented 9 years ago

I'm digging into the root of the problem. I found this full

And the following is a minimal example that reproduces it.

cloud { do System.IO.File.WriteAllText(System.IO.Path.GetRandomFileName(), new String('1', 100000000))  } |> runtime.Run

@krontogiannis Maybe you are using somewhere in MBrace.Azure GetRandomFileName ?

krontogiannis commented 9 years ago

Fixed in https://github.com/mbraceproject/MBrace.Azure/commit/2f24461addcf5b7ab02658338770fce4b7978cee

isaacabraham commented 9 years ago

Awesome :-) That was quick!

palladin commented 9 years ago

@mathias-brandewinder You can pull the master branch and test your scenarios.