Closed hongdai closed 10 years ago
Can you specify the rate and the scenario you're testing? Currently our stress requirements are quite low as we're not expecting any stress on kudu.
I have updated the issue with below:
Load: In our stress tests, we do one kudu operation to a site at a time, at a rate of around 2 kudu operations per min. Scenarios: A mix of update, delete, get. Get has the most load.
Do you have stack traces of the relevant deadlocked thread? Thanks!
I’ve sent it to the kududev as well.
From the call stacks, thread 25 and 26 looks deadlocked:
Thread 26 is to read stdout. The content shows it.
0:026> !do 0313cbe8 Name: System.Byte[] MethodTable: 72ebcf5c EEClass: 72b2c46c Size: 4108(0x100c) bytes Array: Rank 1, Number of elements 4096, Type Byte Content: [VfsUpdateBranch 9530d16] Committing update from request /scmvfs/service/api/random.js. 1 file changed, 9 deletions(-). delete m
Thread 25 should be read stderror, the content is empty.
25 Id: 8f4.e0c Suspend: 0 Teb: fe984000 Unfrozen
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
0060e7b0 72e4cb5b ntdll!ZwReadFile+0xc
0060e820 72dc8d02 mscorlib_ni!DomainNeutralILStubClass.IL_STUB_PInvoke(<HRESULT 0x80004001>)+0x67
0060e848 72dc8c51 mscorlib_ni!System.IO.FileStream.ReadFileNative(<HRESULT 0x80004001>)+0x72
0060e874 72dc8b35 mscorlib_ni!System.IO.FileStream.ReadCore(<HRESULT 0x80004001>)+0x41
0060e894 73554810 mscorlib_ni!System.IO.FileStream.Read(<HRESULT 0x80004001>)+0x95
0060e8a8 738754ca mscorlib_ni!System.IO.Stream.
26 Id: 8f4.5a4 Suspend: 0 Teb: fe966000 Unfrozen
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
00b0f6c0 72e4cb5b ntdll!ZwReadFile+0xc
00b0f730 72dc8d02 mscorlib_ni!DomainNeutralILStubClass.IL_STUB_PInvoke(<HRESULT 0x80004001>)+0x67
00b0f758 72dc8c51 mscorlib_ni!System.IO.FileStream.ReadFileNative(<HRESULT 0x80004001>)+0x72
00b0f784 72dc8b35 mscorlib_ni!System.IO.FileStream.ReadCore(<HRESULT 0x80004001>)+0x41
00b0f7a4 73554810 mscorlib_ni!System.IO.FileStream.Read(<HRESULT 0x80004001>)+0x95
00b0f7b8 738754ca mscorlib_ni!System.IO.Stream.
24 Id: 8f4.e54 Suspend: 0 Teb: fe963000 Unfrozen
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
00a8e258 743b2c4e ntdll!ZwWaitForMultipleObjects+0xc
00a8e2a8 743b29a5 clr!GetMetaDataInternalInterface+0x2e1aa
00a8e334 743b2a96 clr!GetMetaDataInternalInterface+0x2df01
00a8e3a0 743b2bf4 clr!GetMetaDataInternalInterface+0x2dff2
00a8e3ec 74396324 clr!GetMetaDataInternalInterface+0x2e150
00a8e400 743feada clr!GetMetaDataInternalInterface+0x11880
00a8e414 743fec18 clr!DllGetClassObjectInternal+0x12045
00a8e4c4 743fecf9 clr!DllGetClassObjectInternal+0x12183
* WARNING: Unable to verify checksum for mscorlib.ni.dll
* ERROR: Module load completed but symbols could not be loaded for mscorlib.ni.dll
00a8e560 72d9a5fa clr!DllGetClassObjectInternal+0x12264
00a8e570 72d9a53b mscorlib_ni!System.Threading.Monitor.Wait(<HRESULT 0x80004001>)+0x16
00a8e5c8 72e41f32 mscorlib_ni!System.Threading.Monitor.Wait(<HRESULT 0x80004001>)+0x7
00a8e5c8 72e0baa7 mscorlib_ni!System.Threading.ManualResetEventSlim.Wait(<HRESULT 0x80004001>)+0x182
00a8e608 734eda0b mscorlib_ni!System.Threading.Tasks.Task.SpinThenBlockingWait(<HRESULT 0x80004001>)+0x93
00a8e660 734ef2b0 mscorlib_ni!System.Threading.Tasks.Task.InternalWait(<HRESULT 0x80004001>)+0x153
00a8e674 73814e72 mscorlib_ni!System.Threading.Tasks.Task1[[System.Threading.Tasks.VoidTaskResult, mscorlib]].GetResultCore(<HRESULT 0x80004001>)+0xfffec1b8 00a8e6d0 05fd6778 mscorlib_ni!System.Threading.Tasks.Task
1[[System.Threading.Tasks.VoidTaskResult, mscorlib]].get_Result(<HRESULT 0x80004001>)+0x311ac2
* WARNING: Unable to verify checksum for Kudu.Core.dll
* ERROR: Module load completed but symbols could not be loaded for Kudu.Core.dll
00a8e6d0 05fd9981 Kudu_Core!Kudu.Core.Infrastructure.Executable.Execute(<HRESULT 0x80004001>)+0x170
00a8e704 0635c8c9 Kudu_Core!Kudu.Core.SourceControl.Git.GitExeRepository.Commit(<HRESULT 0x80004001>)+0x81
* WARNING: Unable to verify checksum for Kudu.Services.dll
* ERROR: Module load completed but symbols could not be loaded for Kudu.Services.dll
00a8e748 0635c737 Kudu_Services!Kudu.Services.SourceControl.LiveScmEditorController+
I got a chance to look at the dump. The symptom was the same as the previous issue (https://github.com/projectkudu/kudu/issues/754). In the past, Zumo ran into this with 'git rev-parse HEAD' command and this time with 'git commit -m ...'.
The git.exe process has exited however reading from its stdout and stderr were still blocked. In fact, in this situation, no data was ever read from stdout or stderr at all (meaning it was blocked on the first read attempt). What we expected was stdout/err pipe should be flushed and closed when process died.
I have written a simple stress calling GitExeRepository.CurentId in the tight loop .. so far it has been running for an hour without any issue. Will see how it go.
A couple of ideas we could do as a defend in depth
What’s the reason we cannot use the recommended way, i.e. use Process.BeginErrorReadLine to read stderror?
Thanks Hong. We certainly are exploring that approach.
Fix https://github.com/projectkudu/kudu/commit/b35f506fdcfd5cbbcb461a09517b05fe4be50a94 to be true async with stdio
This issue occurs in the recent Zumo Stress runs.
MSDN(http://msdn.microsoft.com/en-us/library/system.diagnostics.process.standarderror.aspx) recommends to use Process.BeginErrorReadLine or Process.BeginOutputReadLine to read either stdout or stderr asynchronously to avoid deadlocks.
The related code is in https://github.com/projectkudu/kudu/blob/master/Kudu.Core/Infrastructure/Executable.cs line 91.
I can provide a memory dump if needed.
Load: In our stress tests, we do one kudu operation to a site at a time, at a rate of around 2 kudu operations per min.
Scenarios: A mix of update, delete, get. Get has the most load.