microsoft / ApplicationInsights-dotnet

ApplicationInsights-dotnet
MIT License
561 stars 282 forks source link

Bug in Application Insights SDK results in fake inflated bills #1750

Closed SidShetye closed 4 years ago

SidShetye commented 4 years ago

Steps to reproduce.

We cannot predict how the SDK encountered the bug, we simply noticed this in our pre-production environment. Note there is also a Microsoft Support Ticket # 120020121000224 open since Jan 31st but progress has been p.a.i.n.f.u.l. on it.

What is the expected behavior?

Billing to be reflected of consumption.

What is the actual behavior?

Billing is inflated because the SDK is artificially inflating usage. See details below.

Concerns

Couple of them, stacked leading to something worth reporting back

  1. Application Insights recently decided to capture all counters by default and sent (arguably acceptable as a product evolution)
  2. There is a bug that causes the SDK to send a LOT of those (not acceptable)
  3. Customers still continue to get a bill for "excess" use of something because of a defect in the product (yikes! absolutely unacceptable).

In our case the increase to our bill was insignificant in the grand scheme of things because we had setup hard caps on daily billing/collection but other customers would be in for a very rude shock if they didn't do this. I wanted to raise this with the team because such billing behavior can give a product a bad reputation.

Details

We noticed a sudden spike in our AppInsights metrics

image

Turns out this is from the customMetrics that application insights automatically collects (i.e. it's not our code that's causing this).

image

Particularly at this time

image

Those metrics are supposed to be independent of traffic and supposed to be low volume, scaling just based on # of instances. That makes it even more strange because our billing shows those custom metrics to be 800% greater our scaled out peak and 4000% greater than the scale down minimum (for the instances).

Since our instances never scaled anywhere close to that (by an order of magnitude at least), we suspect the issue lies in either the SDK sending the events out or the service recording the events.

cijothomas commented 4 years ago

I was engaged to this ticket last week and shared my findings. (about an upgrade happening during the same time issue occurred, and how it can likely cause temp. spike in instance count)

Regd. your concerns:

  1. Application Insights have over the years evolved to collect telemetry from more and more sources. EventCounter is one of them. (We are still considering if we should make this opt-in)

  2. Fully agree that SDK should not sent more telemetry than it is supposed to. However, we have never been able to repro this behavior, nor have any other customer reported this. (Doesn't mean SDK has no bug, just that we haven't got any repro or report). So I cannot confirm if there is a bug or not in the product yet.

  3. Only today I got any evidence that the spike in data is likely from a product bug, rather than from a spike in instance count. This is after seeing that more than expected eventcounters are reported with a single instance name. This information was not available before, and hence I couldn't confirm source of issue. Even though we cannot conclusively say there is a bug, we'd recommend to offer billing refunds.

I am trying to get a repro for this issue. If you share any more details on how your upgrades are rolled out, it'd really help us get to a repro. From your case, its seems like the issue occurred only during an upgrade. I have had no luck getting a repro. So any specifics about upgrade process in your environment would help.

SidShetye commented 4 years ago

Thanks @cijothomas. Our upgrades are fairly run-of-the mill. That cluster is composed of dual deployed App Services across Windows Servers and Linux. On Linux, the published release is packaged as a container into azure container registry from where it automatically deployed into azure app service for linux containers. On Windows the published files get deployed into a staging slot which gets swapped into the production. Could a glitch in App Service's infrastructure/controller temporarily have caused this?

We're totally fine with no refunds because asking Azure billing for a refund (just $20-$30 extra thanks to our calibrated daily caps) isn't worth the time I'd have to spend documenting everything. Just reporting it here just as a matter of principle.

cijothomas commented 4 years ago

Thanks for shedding more details. Can you figure out from the roleinstance name whether the excess data came from Windows or Linux servers? That'll help further narrow down.

I suspect, deployment from the container in ACR to app service linux - sounds like a possible place where things can go wrong. (I am not familiar with this model, so I have to try this out to see if I can experience atleast some spike in evtc).

SidShetye commented 4 years ago

Sure, it appears it was from Linux containers, which always have an ephemeral hostname that is a hash. The Windows hosts are stable/identifiable by RDxxxxxxxx hostnames.

image

SidShetye commented 4 years ago

Also in terms of which SDK version was adding to that

customMetrics
| where timestamp > datetime(2020-1-30T02:00:00Z) and timestamp < datetime(2020-1-30T06:00:00Z)
| where sdkVersion startswith "evtc:"
| summarize count() by sdkVersion, bin(timestamp, 1d)
1/30/2020, 12:00:00.000 AM | evtc:2.12.0-21496 | 11,707 
1/30/2020, 12:00:00.000 AM | evtc:2.12.1-36699 | 583,310  <====
cijothomas commented 4 years ago

So far we know:

  1. The huge spike occurred in Linux, during an update of AI SDK.
  2. It was the new SDK which reported huge counters.
  3. Between 2.12 and 2.12.1, there are no eventcounter change. 2.12.1 was a hotfix for unrelated issue.
  4. Issue self mitigated after few hours.
ank3it commented 4 years ago

@SidShetye : We are processing your refund but wanted some clarity on the exact dates when you experienced a spike in your bill. Do you mind replying back to me with those details at anksr@microsoft.com?

-Ankit S Product Manager, Azure

SidShetye commented 4 years ago

@ank3it I appreciate the gesture but IIRC, the final dollar value was just an extra $30 or so towards the end of Jan/start of Feb. While our otherwise generous daily cap protected our bill, we did lose medium level operational data to this low-priority AI-generated junk-flood as it burnt though our caps.

But the biggest issue was that the defect caused our compute instances to misbehave, which required several days of (far more $$$) engineering time to investigate. So reporting it here to have it fixed in the product rather than get a refund.

TimothyMothra commented 4 years ago

we need a repro to investigate this case

SidShetye commented 4 years ago

we need a repro to investigate this case

@TimothyMothra @ank3it @cijothomas I can see how that helps but we have freely donated tons of time helping the vendor (you) fix their software defect. See

Note there is also a Microsoft Support Ticket # 120020121000224 open since Jan 31st but progress has been p.a.i.n.f.u.l. on it.

Asking us for any additional information (=time, =$$$) to improve your product quality is quite unreasonable now. If you wish to improve your product, you are free to put in the effort by all means. If you wish to ignore and close this ticket, that's fine - it is your product after all. For us, we've moved onto other APM products and even Prometheus looks quite promising.

ank3it commented 4 years ago

Hey Sid,

We apologize for any distress this bug might have caused you and your engineering team. As a product team we are always looking for feedback from our customers, so that we can make the necessary improvements to the product.

Your support in gathering debugging data is highly valuable to us and we sincerely thank you for all your efforts. For this, we've gone ahead and processed a 50% credit to your monthly usage.

Unfortunately, we've not been able to replicate this issue or have seen any other instances of this issue being reported to us. Due to the following reasons, we'll now be closing this issue.

Regards, Ankit

joshsten commented 8 months ago

We have spontaneously noticed some similar behavior on September 27th of this year on one of our environments. We are in process of implementing a custom sampling override for the counters. There are two specific instances sending orders of magnitude more counter telemetry from the EventCounterCollectionSource even with restart. Hard caps protected us in our case too. We have extra counters enabled, but it appears to be multiple GB of data for a couple instances over a few hours. It appears most if not all the counters scale out by the same amounts. This is on the 2.21 family of AI packages. Some environments seem fine. These are all Linux based docker containers running on Linux Kubernetes nodes.

Our usage graphs, characterization, and patterns look really close to the same of the original poster. Only ours did not resolve. We are implementing custom sampling as a telemetry filter to bypass the issue on our own.

Example image attached. This case shows triplicated values at the exact same times for the same counters. Even if one tried to claim that was multiple instances it seems highly unlikely so many would line up so perfectly down to ms or ns on the timestamps, and values.... Perhaps it's multiple datapoints over a minute sent at same time? Well the thing with that there is clearly some crazy counter metric numbers going on because our charts look like OPs. We run apps in Kubernetes. It appears only two of our environments are affected. There appears to be something environmental or contextual that creates it even with new deploys.

While I saw some commentary claiming this involved PerformanceCounters rather than EventCounters the spikes I see are happening in AppMetrics in Log Analytics, and I was able to confirm queries and volumes for the added Event Counters. Performance Counters appear to be okay.

Similar Report on SO: https://stackoverflow.com/questions/57881155/why-am-i-seeing-a-large-spike-in-data-ingestion-from-performance-counter-data

image