microsoft / ApplicationInsights-dotnet

ApplicationInsights-dotnet
MIT License
565 stars 287 forks source link

InMemoryChannel does not allow user to supply retry logic or to access the queued items/count to account for transient failures. #1123

Closed brdeyo closed 3 years ago

brdeyo commented 6 years ago

We are a team at Microsoft building services for the Windows ES and are using Application Insights to capture telemetry. For our current project, we are deploying OS images to build machines and rely on a WinPE image/environment to setup the physical machines for image deployment. In this WinPE environment we do not have access to ETW nor a reliable file system. So, we are leveraging the InMemoryChannel to buffer and send telemetry events.

The issue is that the InMemoryChannel does not allow us to identify when an attempt to send a telemetry event fails and does not expose how many items are currently buffered/queued awaiting send. During times when there is a lot of network traffic caused by Windows ES workloads, we are failing to get telemetry off of the machines as part of the OS image deployment. Whereas, we do not expect 100% success rate, we would like to have the ability to apply transient error handling (i.e. retry policies) when attempting to send telemetry events. We would also like to identify when there are events that have not yet been sent so that we can wait for all of those to be flushed before our image deployment process exits and the machine is rebooted.

Telemetry is critical to our ability to monitor and manage this service and is especially important for our customers who will be using the service to manage their reimaging process.

We believe that exposing the second constructor on the InMemoryChannel as public (or protected) as well as the TelemetryBuffer and InMemoryTransmitter classes will allow us to add custom retry and buffered item visibility logic using an Adapter class over the InMemoryChannel.

cijothomas commented 5 years ago

Found this unattended issue while looking at bug fixes in ServerTelemetryChannel.

Since you need retry mechanisms for transient failures, but cannot use file system for storage, i'd recommend to use ServerTelemetryChannel itself but with MaxTransmissionStorageCapacity set to 0. This means you'll get all the reliability of ServerTelemetryChannel minus the ability to store items to disk when in-memory-capacity is full. You can also set MaxTransmissionBufferCapacity to a high value, so that more transmissions can be kept in memory, avoiding the need for Disk storage at all.

Please let us know if this can work for you.

cijothomas commented 5 years ago

Just realized that due to ErrorHandling policies, the above suggestion wont work - upon encountering any network error, items are sent to DiskStorage and retried after 'n' seconds. And if disk storage is not available, it'll be lost.

brdeyo commented 5 years ago

Cijo,

Thanks for the follow up response. Our team works here at Microsoft on one of the engineering systems. We implemented an in-memory channel that uses a persistent queue in-memory to maintain events until they are successfully sent. We took some of the naming conventions and patterns from the InMemoryChannel, TelemetryBuffer etc.. and revamped it for increased reliability. We additionally exposed the ability to see how many non-transmitted events exist in the buffer and a "flush-and-wait" feature to allow our applications to attempt any number of retry attempts over a specified interval before exiting. This has been very effective for our applications and services.

For Example: Our networks get very, very busy in the evenings and we were originally finding that we were losing 10+% of our telemetry events to failed network calls. With the new channel implementation, we are losing less than 1% and typically less than 0.1% even on the busiest evenings. We rely heavily on telemetry to run our service especially for monitoring and alerting.

We would like to submit the 'ReliableInMemoryChannel' we wrote back into the Application Insights codebase so that others across the company and the world can take advantage of it. Is this something that your team would support?

Thanks, Bryan DeYoung Senior Software Engineer, COSINE|Windows Engineering System


From: Cijo Thomas notifications@github.com Sent: Friday, January 18, 2019 8:50 AM To: Microsoft/ApplicationInsights-Home Cc: longrunconsulting; Author Subject: Re: [Microsoft/ApplicationInsights-Home] InMemoryChannel does not allow user to supply retry logic or to access the queued items/count to account for transient failures. (#297)

Just realized that due to ErrorHandling policies, the above suggestion wont work - upon encountering any network error, items are sent to DiskStorage and retried after 'n' seconds. And if disk storage is not available, it'll be lost.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FApplicationInsights-Home%2Fissues%2F297%23issuecomment-455613098&data=02%7C01%7C%7C82f6ad2628a64f82e86c08d67d651f0c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636834270591375781&sdata=QN4wVNZ0FuSQIeetsJizHA8mkqqVrWSs6i1u3Ej4Oms%3D&reserved=0, or mute the threadhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAhvenlSJSZbrMk8zt3eokJ26scYDUeVuks5vEftxgaJpZM4YemSq&data=02%7C01%7C%7C82f6ad2628a64f82e86c08d67d651f0c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636834270591375781&sdata=%2FpF8MY4B9GuTgTsiYEBemSSIKk2wETev7sm5mNr76tg%3D&reserved=0.

cijothomas commented 5 years ago

@longrunconsulting Thanks Bryan for sharing more context. We are always glad to accept contributions to the SDK! I'd like to know more on what specifics are done in ReliableInMemoryChannel.

Would it have solved you issue, if ServerTelemetryChannel has a 'no-disk' option - so that errors would result in items being kept in memory, and retrying from the memory itself? And expose more internals like pending items etc.? I'd like to avoid having a 3rd Channel (its hard to explain and cause lot of customer confusions typically), instead use the contribution to improve the existing ServerTelemetryChannel as that is the default one for all customers.

cijothomas commented 5 years ago

You can mail me cithomas@microsoft*.com .. (Remove all the *s) and we can have a call or somethign to discuss more if thats faster.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 300 days with no activity. Remove stale label or comment or this will be closed in 7 days.