Closed Amarok79 closed 6 years ago
First, I'll look into your patchset. Honestly, I would have expected the GC to do better (especially for the tracing, boxing occurs often without even knowing, and I've only seen the Mono IDE make it easy to spot this).
I'm thinking of abstracting the code changes you made into a new class to keep readability in an already overly large function. But your code is sound (I couldn't find a better way, maybe the Splice would be nice, but not for .NET 4.0). I miss C...
For the tracing, give me a moment to think. There's a ShouldTrace which I want to look into. I only did something like this prior for Enums, as Enum.ToString() is very expensive.
Are you sure that the use of the GC really is a problem?
I wouldn't classify this as bug. I consider this more as an improvement, a kind of performance optimization. At least that was my intention.
These allocations aren't really a big thing on Windows on modern hardware, but it's a base-allocation-load in my scenario that I'd like to reduce/remove.
Maybe, some background. I have 3 serial devices beside other sensors I continuously communicate with. My app is a kind of hardware testing application and throughput is the main goal. Reducing allocations when achievable with a reasonable effort helps to reduce the base-background-load. More important as the app runs on low-cost x86 hardware.
I'm thinking of abstracting the code changes you made into a new class to keep readability in an already overly large function. But your code is sound (I couldn't find a better way, maybe the Splice would be nice, but not for .NET 4.0). I miss C...
Yes, that might make sense. Maybe refactoring out into a nested type, since there wouldn't be another consumer of that helper class. If you want, I can change the PR accordingly.
For the tracing, give me a moment to think. There's a ShouldTrace which I want to look into. I only did something like this prior for Enums, as Enum.ToString() is very expensive.
ShouldTrace
is for implementers of custom TraceSource
. That doesn't help.
I'm trying to test. What tools are you using to check the performance? PerfMon isn't working for me at the moment, and I'll need more time to investigate that. I'd like to confirm any changes done work as expected.
I'm using real memory profiler, like .NET Memory Profiler or dotMemory from Resharper. Those are commercial products, but there is also a memory profiler in VS2017 Community Edition. This might do the trick, but I'm not sure as I don't use it.
I'm trying out dotMemory briefly. It looks like one of the biggest culprits is raising events, because a new EventArgs object is allocated every time. Do you have experience how this is avoided? I'm still trying to understand the actual impact.
I implemented a very small test program that just writes data to one serial port, and reads it from the other in a loop with 100ms in between. Here's the lifetime.
The unmanaged heap (grey) is growing and shrinking, but the GC0, GC1, GC2 and LOH look pretty stable. I do admit, I've never really done much in terms of memory profiling, just using my experience in C, and hoped that in release mode the GC knows to clean up without causing the GC to run
Hey @jcurl, let me say, you already did a pretty nice job with this library in terms of allocations and performance. Many thanks for a robust serial library. I know that it's hard to write as I did something similar (closed source and not maintained anymore) years ago as there wasn't a SerialPort
class available. Seems decades ago ;-)
Next, don't misunderstood my intentions. I don't argue for allocating zero objects. That's not what the GC is for. We should distinguish between allocations that serve some purpose and those that are pure overhead.
I would say, event args serve a purpose. That's the pattern used in .NET and it requires to allocate those event args. That's fine, because those objects are really small and live a really short time (mostly). Working around that would be possible by pooling those objects but that's really not worth the effort.
The allocations that I see coming from tracing, don't serve a purpose (except when you want to trace), but in production mode those are pure overhead without use. Therefore, I suggest to remove/reduce them, if the effort is small enough.
Now, when analyzing memory traffic of my scenario, I liked to use "Group by Types". I then see following:
The number of WaitHandle[]
allocations is coming from .NET FX itself (in fact, WaitHandle.WaitAny()
itself creates another copy of the supplied array; see Reference Source). Note, I already use a version of SerialStreamPort that reuses/pools WaitHandle[]
here. So we did our best to reduce allocations (in fact cut it by half), but can't fix .NET FX. Okay for me.
There are a few WaitHandle[]
coming from SerialBuffer.WaitForRead()
. But that's a minority. Not sure, whether we can improve that easily.
Next.
The Object[]
allocations are primarily coming from the params Object[]
of your TraceEvent(..)
invocations. With every TraceEvent(..)
an Object[]
is implicitly created for holding the arguments.
Then we have the boxing of all value-type arguments like SerialError
, SerialPinChange
, because those get stored into the Object[]
and therefore need to be boxed.
Then the TraceEventCache
objects used internally by TraceSource
.
Those are all useless allocations that we can suppress by simply wrapping TraceSource.TraceEvent()
into another if (..)
.
Actually, I find it interesting to get to the roots of memory allocation and understanding better how to write high performance code from the start. I'm from the C/Assembly world and I'm very used to managing my own memory, hence initially thinking about memory allocations from the start.
Thanks for your link to the reference sources. Explains why one of my changes didn't work as I expected (but will integrate) to make some arrays global instead of allocating on the heap every time to throw them away. That was certainly easier to optimize.
Now, I'm already using a switch as shown. Now googling everyone says (and MS docs says) don't do it. Honestly, I think the benefit of doing this is huge. I wanted to really look at what's going on here.
// ERROR_OPERATION_ABORTED may be caused by CancelIo or PurgeComm
if (SerialTrace.TraceSer.Switch.ShouldTrace(System.Diagnostics.TraceEventType.Verbose)) {
SerialTrace.TraceSer.TraceEvent(System.Diagnostics.TraceEventType.Verbose, 0,
"{0}: SerialThread: Overlapped ReadFile() error {1} bytes {2}", m_Name, win32Error, bytes);
}
All this for a trace!
// [464 33 - 464 120]
IL_0480: call class [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/ RJCP.IO.Ports.SerialTrace/*02000012*/::get_TraceSer()/*060000C2*/
IL_0485: callvirt instance class [System/*23000003*/]System.Diagnostics.SourceSwitch/*01000074*/ [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/::get_Switch()/*0A0000F2*/
IL_048a: ldc.i4.s 16 // 0x10
IL_048c: callvirt instance bool [System/*23000003*/]System.Diagnostics.SourceSwitch/*01000074*/::ShouldTrace(valuetype [System/*23000003*/]System.Diagnostics.TraceEventType/*01000060*/)/*0A0000F3*/
IL_0491: stloc.s V_41
IL_0493: ldloc.s V_41
IL_0495: brfalse.s IL_04cf
// [464 121 - 464 122]
IL_0497: nop
// [465 37 - 466 132]
IL_0498: call class [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/ RJCP.IO.Ports.SerialTrace/*02000012*/::get_TraceSer()/*060000C2*/
IL_049d: ldc.i4.s 16 // 0x10
IL_049f: ldc.i4.0
IL_04a0: ldstr "{0}: SerialThread: Overlapped ReadFile() error {1} bytes {2}"
IL_04a5: ldc.i4.3
IL_04a6: newarr [mscorlib/*23000001*/]System.Object/*01000018*/
IL_04ab: dup
IL_04ac: ldc.i4.0
IL_04ad: ldarg.0 // this
IL_04ae: ldfld string RJCP.IO.Ports.Native.Windows.CommOverlappedIo/*02000026*/::m_Name/*040000C2*/
IL_04b3: stelem.ref
IL_04b4: dup
IL_04b5: ldc.i4.1
IL_04b6: ldloc.s win32Error_V_38
IL_04b8: box [mscorlib/*23000001*/]System.Int32/*0100004C*/
IL_04bd: stelem.ref
IL_04be: dup
IL_04bf: ldc.i4.2
IL_04c0: ldloc.s bytes
IL_04c2: box [mscorlib/*23000001*/]System.UInt32/*01000073*/
IL_04c7: stelem.ref
IL_04c8: callvirt instance void [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/::TraceEvent(valuetype [System/*23000003*/]System.Diagnostics.TraceEventType/*01000060*/, int32, string, object[])/*0A000084*/
IL_04cd: nop
So we're allocating arrays and then boxing the integers into that array (just as you said). Kind of makes me want to puke on how much memory is being used. No wonder we need i7 processors with mega L1 caches. Well, I'm more productive now with C# than assembly... The price we pay.
I should be able to get something done to generalize tracing and still keep the code readable. Give me a couple of days.
An implementation is on branch "bugfix/dotnet-163" with commit 08a51143eb5367ef78b5e1dd239b41630e47b7c3. The number of objects allocated should be sharply reduced when there's no tracing.
Both the tracing problem, and the excessive allocation of WaitHandles has been implemented with 4999c40340c0945ac5d2b5313900580cc9a458b5 and 93eb323633bafbbe5922c7fdff20539d5fa432c6 respectively. I look forward to your testing.
I created a local package with your newest changes and integrated it into my project. Here are some numbers for comparison. These represent the number of allocations and bytes allocated by your library for 1000 serial request/response cycles of my test application.
Before: 49000 allocated objects with 711000 bytes allocated
Now: 30228 allocated objects with 402000 bytes allocated
This is a great improvement. Fantastic!
Nevertheless, I still see some boxing of SerialError
and SerialPinChange
occurring. Both in RJCP.SerialPortStream!RJCP.IO.Ports.SerialPortStream.HandleEvent( object )
.
10000 objects of each SerialError
and SerialPinChange
being allocated, each with 120 KB of allocation.
Seems as this is caused by Enum.HasFlag(..)
. It does a hidden boxing...
// else if (((Enum)(object)serialDataFlags).HasFlag((Enum)(object)SerialData.Eof))
// sequence point: (line 1898, col 28) to (line 1898, col 72) in d:\GitHub\SerialPortStream\code\SerialPortStream.cs
IL_0106: ldloc.0
IL_0107: box RJCP.IO.Ports.SerialData
IL_010c: ldc.i4.2
IL_010d: box RJCP.IO.Ports.SerialData
IL_0112: call instance bool [mscorlib]System.Enum::HasFlag(class [mscorlib]System.Enum)
// (no C# code)
IL_0117: brfalse.s IL_0126
In general, Enum.HasFlag()
isn't recommended to use. It's known to be slow (https://stackoverflow.com/questions/7368652/what-is-it-that-makes-enum-hasflag-so-slow). Use a bit-wise comparison in hot paths instead.
Apart from that, a big improvement. Many thanks!
Done. Commited in dbca7120ef8c1903fbb8d36a489113c1f1bcfb0a
Okay, let some stress-tests run over the weekend with your latest version. Everything alright... :+1:
I currently investigate memory allocations of an application, which uses your library. My usage scenario consists of a short message sent to a serial device, then waiting for a response and then do that again. When talking to the serial device every 15-20 ms, I see a number of allocations coming from your library.
I noticed a large number of allocations (~ 11 KB/sec) coming from a single location that can be easily improved. Please, see PR #57 for that.
Next, I see allocations coming from your use of
TraceSource
for tracing, which generates ~35 KB/sec in my scenario, even though tracing is disabled. This comes mainly from boxing arguments and fromTraceSource
itself.I would like to reduce allocations here, but don't know how to proceed.
Basically, we need to rewrite every
to something like
This additional
if(..)
will ensure that we don't box arguments if not really necessary.Now,
TraceSource
itself doesn't support something likeIsEnabled
, thus, there is no way to determine whether tracing has been enabled or not.I would propose to introduce another BooleanSwitch into code, so that we can check condition. But, that would mean developers need to add another line to their
app.config
for enabling tracing (in addition to https://github.com/jcurl/SerialPortStream/wiki/Tracing).Something like this would be necessary.
What do you think?