jcurl / RJCP.DLL.SerialPortStream

SerialPortStream is an independent implementation of System.IO.Ports.SerialPort and SerialStream for better reliability and maintainability. Default branch is 2.x and now has support for Mono with help of a C library.
Microsoft Public License
624 stars 197 forks source link

Large number of allocations resulting from tracing, even though it's disabled #58

Closed Amarok79 closed 6 years ago

Amarok79 commented 6 years ago

I currently investigate memory allocations of an application, which uses your library. My usage scenario consists of a short message sent to a serial device, then waiting for a response and then do that again. When talking to the serial device every 15-20 ms, I see a number of allocations coming from your library.

I noticed a large number of allocations (~ 11 KB/sec) coming from a single location that can be easily improved. Please, see PR #57 for that.

Next, I see allocations coming from your use of TraceSource for tracing, which generates ~35 KB/sec in my scenario, even though tracing is disabled. This comes mainly from boxing arguments and from TraceSource itself.

I would like to reduce allocations here, but don't know how to proceed.

Basically, we need to rewrite every

SerialTrace.TraceSer.TraceEvent(System.Diagnostics.TraceEventType.Verbose, 0,
    "{0}: SerialThread: ProcessWriteEvent: {1} bytes - Purged", m_Name, bytes);

to something like

if (IsTracingEnabled)
{
    SerialTrace.TraceSer.TraceEvent(System.Diagnostics.TraceEventType.Verbose, 0,
        "{0}: SerialThread: ProcessWriteEvent: {1} bytes - Purged", m_Name, bytes);
}

This additional if(..) will ensure that we don't box arguments if not really necessary.

Now, TraceSource itself doesn't support something like IsEnabled, thus, there is no way to determine whether tracing has been enabled or not.

I would propose to introduce another BooleanSwitch into code, so that we can check condition. But, that would mean developers need to add another line to their app.config for enabling tracing (in addition to https://github.com/jcurl/SerialPortStream/wiki/Tracing).

Something like this would be necessary.

<?xml version="1.0" encoding="utf-8" ?>
 <configuration>
   <system.diagnostics>
     <switches>
       <add name="SerialPortStreamTracingEnabled" value="true" />
     </switches>
   </system.diagnostics>
 </configuration>

What do you think?

jcurl commented 6 years ago

First, I'll look into your patchset. Honestly, I would have expected the GC to do better (especially for the tracing, boxing occurs often without even knowing, and I've only seen the Mono IDE make it easy to spot this).

I'm thinking of abstracting the code changes you made into a new class to keep readability in an already overly large function. But your code is sound (I couldn't find a better way, maybe the Splice would be nice, but not for .NET 4.0). I miss C...

jcurl commented 6 years ago

For the tracing, give me a moment to think. There's a ShouldTrace which I want to look into. I only did something like this prior for Enums, as Enum.ToString() is very expensive.

Are you sure that the use of the GC really is a problem?

Amarok79 commented 6 years ago

I wouldn't classify this as bug. I consider this more as an improvement, a kind of performance optimization. At least that was my intention.

These allocations aren't really a big thing on Windows on modern hardware, but it's a base-allocation-load in my scenario that I'd like to reduce/remove.

Maybe, some background. I have 3 serial devices beside other sensors I continuously communicate with. My app is a kind of hardware testing application and throughput is the main goal. Reducing allocations when achievable with a reasonable effort helps to reduce the base-background-load. More important as the app runs on low-cost x86 hardware.

Amarok79 commented 6 years ago

I'm thinking of abstracting the code changes you made into a new class to keep readability in an already overly large function. But your code is sound (I couldn't find a better way, maybe the Splice would be nice, but not for .NET 4.0). I miss C...

Yes, that might make sense. Maybe refactoring out into a nested type, since there wouldn't be another consumer of that helper class. If you want, I can change the PR accordingly.

Amarok79 commented 6 years ago

For the tracing, give me a moment to think. There's a ShouldTrace which I want to look into. I only did something like this prior for Enums, as Enum.ToString() is very expensive.

ShouldTrace is for implementers of custom TraceSource. That doesn't help.

jcurl commented 6 years ago

I'm trying to test. What tools are you using to check the performance? PerfMon isn't working for me at the moment, and I'll need more time to investigate that. I'd like to confirm any changes done work as expected.

Amarok79 commented 6 years ago

I'm using real memory profiler, like .NET Memory Profiler or dotMemory from Resharper. Those are commercial products, but there is also a memory profiler in VS2017 Community Edition. This might do the trick, but I'm not sure as I don't use it.

jcurl commented 6 years ago

I'm trying out dotMemory briefly. It looks like one of the biggest culprits is raising events, because a new EventArgs object is allocated every time. Do you have experience how this is avoided? I'm still trying to understand the actual impact.

image

I implemented a very small test program that just writes data to one serial port, and reads it from the other in a loop with 100ms in between. Here's the lifetime.

image

The unmanaged heap (grey) is growing and shrinking, but the GC0, GC1, GC2 and LOH look pretty stable. I do admit, I've never really done much in terms of memory profiling, just using my experience in C, and hoped that in release mode the GC knows to clean up without causing the GC to run

Amarok79 commented 6 years ago

Hey @jcurl, let me say, you already did a pretty nice job with this library in terms of allocations and performance. Many thanks for a robust serial library. I know that it's hard to write as I did something similar (closed source and not maintained anymore) years ago as there wasn't a SerialPort class available. Seems decades ago ;-)

Next, don't misunderstood my intentions. I don't argue for allocating zero objects. That's not what the GC is for. We should distinguish between allocations that serve some purpose and those that are pure overhead.

I would say, event args serve a purpose. That's the pattern used in .NET and it requires to allocate those event args. That's fine, because those objects are really small and live a really short time (mostly). Working around that would be possible by pooling those objects but that's really not worth the effort.

The allocations that I see coming from tracing, don't serve a purpose (except when you want to trace), but in production mode those are pure overhead without use. Therefore, I suggest to remove/reduce them, if the effort is small enough.


Now, when analyzing memory traffic of my scenario, I liked to use "Group by Types". I then see following: capture1

The number of WaitHandle[] allocations is coming from .NET FX itself (in fact, WaitHandle.WaitAny() itself creates another copy of the supplied array; see Reference Source). Note, I already use a version of SerialStreamPort that reuses/pools WaitHandle[] here. So we did our best to reduce allocations (in fact cut it by half), but can't fix .NET FX. Okay for me.

There are a few WaitHandle[] coming from SerialBuffer.WaitForRead(). But that's a minority. Not sure, whether we can improve that easily.

Next. capture2

The Object[] allocations are primarily coming from the params Object[] of your TraceEvent(..) invocations. With every TraceEvent(..) an Object[] is implicitly created for holding the arguments.

Then we have the boxing of all value-type arguments like SerialError, SerialPinChange, because those get stored into the Object[] and therefore need to be boxed.

Then the TraceEventCache objects used internally by TraceSource.

Those are all useless allocations that we can suppress by simply wrapping TraceSource.TraceEvent() into another if (..).

jcurl commented 6 years ago

Actually, I find it interesting to get to the roots of memory allocation and understanding better how to write high performance code from the start. I'm from the C/Assembly world and I'm very used to managing my own memory, hence initially thinking about memory allocations from the start.

Thanks for your link to the reference sources. Explains why one of my changes didn't work as I expected (but will integrate) to make some arrays global instead of allocating on the heap every time to throw them away. That was certainly easier to optimize.

Now, I'm already using a switch as shown. Now googling everyone says (and MS docs says) don't do it. Honestly, I think the benefit of doing this is huge. I wanted to really look at what's going on here.

// ERROR_OPERATION_ABORTED may be caused by CancelIo or PurgeComm
if (SerialTrace.TraceSer.Switch.ShouldTrace(System.Diagnostics.TraceEventType.Verbose)) {
    SerialTrace.TraceSer.TraceEvent(System.Diagnostics.TraceEventType.Verbose, 0,
        "{0}: SerialThread: Overlapped ReadFile() error {1} bytes {2}", m_Name, win32Error, bytes);
}

All this for a trace!

// [464 33 - 464 120]
      IL_0480: call         class [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/ RJCP.IO.Ports.SerialTrace/*02000012*/::get_TraceSer()/*060000C2*/
      IL_0485: callvirt     instance class [System/*23000003*/]System.Diagnostics.SourceSwitch/*01000074*/ [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/::get_Switch()/*0A0000F2*/
      IL_048a: ldc.i4.s     16 // 0x10
      IL_048c: callvirt     instance bool [System/*23000003*/]System.Diagnostics.SourceSwitch/*01000074*/::ShouldTrace(valuetype [System/*23000003*/]System.Diagnostics.TraceEventType/*01000060*/)/*0A0000F3*/
      IL_0491: stloc.s      V_41

      IL_0493: ldloc.s      V_41
      IL_0495: brfalse.s    IL_04cf

      // [464 121 - 464 122]
      IL_0497: nop          

      // [465 37 - 466 132]
      IL_0498: call         class [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/ RJCP.IO.Ports.SerialTrace/*02000012*/::get_TraceSer()/*060000C2*/
      IL_049d: ldc.i4.s     16 // 0x10
      IL_049f: ldc.i4.0     
      IL_04a0: ldstr        "{0}: SerialThread: Overlapped ReadFile() error {1} bytes {2}"
      IL_04a5: ldc.i4.3     
      IL_04a6: newarr       [mscorlib/*23000001*/]System.Object/*01000018*/
      IL_04ab: dup          
      IL_04ac: ldc.i4.0     
      IL_04ad: ldarg.0      // this
      IL_04ae: ldfld        string RJCP.IO.Ports.Native.Windows.CommOverlappedIo/*02000026*/::m_Name/*040000C2*/
      IL_04b3: stelem.ref   
      IL_04b4: dup          
      IL_04b5: ldc.i4.1     
      IL_04b6: ldloc.s      win32Error_V_38
      IL_04b8: box          [mscorlib/*23000001*/]System.Int32/*0100004C*/
      IL_04bd: stelem.ref   
      IL_04be: dup          
      IL_04bf: ldc.i4.2     
      IL_04c0: ldloc.s      bytes
      IL_04c2: box          [mscorlib/*23000001*/]System.UInt32/*01000073*/
      IL_04c7: stelem.ref   
      IL_04c8: callvirt     instance void [System/*23000003*/]System.Diagnostics.TraceSource/*01000032*/::TraceEvent(valuetype [System/*23000003*/]System.Diagnostics.TraceEventType/*01000060*/, int32, string, object[])/*0A000084*/
      IL_04cd: nop          

So we're allocating arrays and then boxing the integers into that array (just as you said). Kind of makes me want to puke on how much memory is being used. No wonder we need i7 processors with mega L1 caches. Well, I'm more productive now with C# than assembly... The price we pay.

I should be able to get something done to generalize tracing and still keep the code readable. Give me a couple of days.

jcurl commented 6 years ago

An implementation is on branch "bugfix/dotnet-163" with commit 08a51143eb5367ef78b5e1dd239b41630e47b7c3. The number of objects allocated should be sharply reduced when there's no tracing.

jcurl commented 6 years ago

Both the tracing problem, and the excessive allocation of WaitHandles has been implemented with 4999c40340c0945ac5d2b5313900580cc9a458b5 and 93eb323633bafbbe5922c7fdff20539d5fa432c6 respectively. I look forward to your testing.

Amarok79 commented 6 years ago

I created a local package with your newest changes and integrated it into my project. Here are some numbers for comparison. These represent the number of allocations and bytes allocated by your library for 1000 serial request/response cycles of my test application.

Before: 49000 allocated objects with 711000 bytes allocated
Now:    30228 allocated objects with 402000 bytes allocated

This is a great improvement. Fantastic!

Nevertheless, I still see some boxing of SerialError and SerialPinChange occurring. Both in RJCP.SerialPortStream!RJCP.IO.Ports.SerialPortStream.HandleEvent( object ).

10000 objects of each SerialError and SerialPinChange being allocated, each with 120 KB of allocation.

Seems as this is caused by Enum.HasFlag(..). It does a hidden boxing...

// else if (((Enum)(object)serialDataFlags).HasFlag((Enum)(object)SerialData.Eof))
// sequence point: (line 1898, col 28) to (line 1898, col 72) in d:\GitHub\SerialPortStream\code\SerialPortStream.cs
IL_0106: ldloc.0
IL_0107: box RJCP.IO.Ports.SerialData
IL_010c: ldc.i4.2
IL_010d: box RJCP.IO.Ports.SerialData
IL_0112: call instance bool [mscorlib]System.Enum::HasFlag(class [mscorlib]System.Enum)
// (no C# code)
IL_0117: brfalse.s IL_0126

In general, Enum.HasFlag() isn't recommended to use. It's known to be slow (https://stackoverflow.com/questions/7368652/what-is-it-that-makes-enum-hasflag-so-slow). Use a bit-wise comparison in hot paths instead.

Apart from that, a big improvement. Many thanks!

jcurl commented 6 years ago

Done. Commited in dbca7120ef8c1903fbb8d36a489113c1f1bcfb0a

Amarok79 commented 6 years ago

Okay, let some stress-tests run over the weekend with your latest version. Everything alright... :+1: