open-telemetry / opentelemetry-swift

OpenTelemetry API for Swift
https://opentelemetry.io/docs/instrumentation/swift/
Apache License 2.0
207 stars 124 forks source link

`SynchronousMetricStorage` thread issue #532

Closed jhoongo closed 2 months ago

jhoongo commented 4 months ago

Issue Details

I'm using opentelemetry-swift SDK to emit (stable) metrics with HTTP Exporter, and I'm getting occasional crash with following exception:

Hardware Model:     iPhone12,5
Role:               Foreground
OS Version:         iOS 16.5.1
Exception Type:     EXC_BAD_ACCESS 
Exception Subtype:  KERN_INVALID_ADDRESS

EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000010.

0  OpenTelemetrySdk +0x77270      specialized Dictionary.subscript.getter (<compiler-generated>)
1  OpenTelemetrySdk +0x77a54      SynchronousMetricStorage.recordDouble(value:attributes:) (SynchronousMetricStorage.swift:107:30)
2  OpenTelemetrySdk +0x77e98      protocol witness for WritableMetricStorage.recordDouble(value:attributes:) in conformance SynchronousMetricStorage (<compiler-generated>)
3  OpenTelemetrySdk +0x5b24c      DoubleHistogramMeterSdk.record(value:attributes:) (DoubleHistogramMeterSdk.swift:27:17)
4  OpenTelemetrySdk +0x5b358      protocol witness for DoubleHistogram.record(value:attributes:) in conformance DoubleHistogramMeterSdk (<compiler-generated>)
5  MyApp +0x13d40c   closure #1 in OtlpClient.emitDoubleHistogram(name:value:attributes:) (OtlpClient.swift:94:14)
6  MyApp +0x141838   thunk for @callee_guaranteed () -> () (<compiler-generated>)
7  MyApp +0x26728    thunk for @escaping @callee_guaranteed () -> () (<compiler-generated>)
8  libdispatch.dylib +0x3ea8      __dispatch_client_callout
9  libdispatch.dylib +0x13424     __dispatch_lane_barrier_sync_invoke_and_complete
10 MyApp +0x1405a4   OtlpClient.emitDoubleHistogram(name:value:attributes:) (OtlpClient.swift:92:9)
Hardware Model:     iPhone14,5
Role:               Foreground
OS Version:         iOS 17.2.1
Exception Type:     EXC_BAD_ACCESS 
Exception Subtype:  KERN_INVALID_ADDRESS

EXC_BAD_ACCESS: Attempted to dereference garbage pointer 0x8000000000000028.

0  OpenTelemetrySdk +0x7d0dc       specialized __RawDictionaryStorage.find<A>(_:) (<compiler-generated>)
1  OpenTelemetrySdk +0x77584       SynchronousMetricStorage.collect(resource:scope:startEpochNanos:epochNanos:) (SynchronousMetricStorage.swift:75:27)
2  OpenTelemetrySdk +0x77e1c       protocol witness for MetricStorage.collect(resource:scope:startEpochNanos:epochNanos:) in conformance SynchronousMetricStorage (<compiler-generated>)
3  OpenTelemetrySdk +0x74650       StableMeterSharedState.collectAll(registeredReader:meterProviderSharedState:epochNanos:) (StableMeterSharedState.swift:97:42)
4  OpenTelemetrySdk +0x6ddf8       StableMeterSdk.collectAll(registerReader:epochNanos:) (StableMeterSdk.swift:49:26)
5  OpenTelemetrySdk +0x6c890       StableMeterProviderSdk.LeasedMetricProducer.collectAllMetrics() (StableMeterProviderSdk.swift:118:49)
6  OpenTelemetrySdk +0x6c9b0       protocol witness for MetricProducer.collectAllMetrics() in conformance StableMeterProviderSdk.LeasedMetricProducer (<compiler-generated>)
7  OpenTelemetrySdk +0x611e8       StablePeriodicMetricReaderSdk.doRun() (StablePeriodicMetricReaderSdk.swift:60:40)
8  OpenTelemetrySdk +0x613c8       protocol witness for StableMetricReader.forceFlush() in conformance StablePeriodicMetricReaderSdk (<compiler-generated>)
9  OpenTelemetrySdk +0x6c6e0       closure #1 in StableMeterProviderSdk.forceFlush() (StableMeterProviderSdk.swift:91:37)

Basically, OtlpClient is a class that instruments metric emission, and we are emitting not a lot, but around 10~20 metrics (Counter/Histogram) at a given time (with 10 seconds export interval). And this is accounting for 0.05% of session crashes for over 10M sessions (which is small, but still consistent and can increase on how you use the SDK).

It looks like when SynchronousMetricStorage collects metric and when a new metric is being recorded, it crashes.

I suspect that it tries to add/export at the same time and result in crash based on the look of the code.

Can we safeguard it with Lock as a SynchronusMetricStorage property for collect/recordLong/recordDouble so that it doesn't interrupt each others? Any thought or suggestion?

jhoongo commented 4 months ago

I was thinking about something like this; wondering if this would work

diff --git a/Sources/OpenTelemetrySdk/Metrics/Stable/State/SynchronousMetricStorage.swift b/Sources/OpenTelemetrySdk/Metrics/Stable/State/SynchronousMetricStorage.swift
index 17f487a..1c6b484 100644
--- a/Sources/OpenTelemetrySdk/Metrics/Stable/State/SynchronousMetricStorage.swift
+++ b/Sources/OpenTelemetrySdk/Metrics/Stable/State/SynchronousMetricStorage.swift
@@ -20,6 +20,7 @@ public class SynchronousMetricStorage: SynchronousMetricStorageProtocol {
     var aggregatorHandles = [[String: AttributeValue]: AggregatorHandle]()
     let attributeProcessor: AttributeProcessor
     var aggregatorHandlePool = [AggregatorHandle]()
+.  let storageLock = Lock()

     static func empty() -> SynchronousMetricStorageProtocol {
         return EmptyMetricStorage.instance
@@ -67,20 +68,30 @@ public class SynchronousMetricStorage: SynchronousMetricStorageProtocol {
     }

     public func collect(resource: Resource, scope: InstrumentationScopeInfo, startEpochNanos: UInt64, epochNanos: UInt64) -> StableMetricData {
+   storageLock.lock()
+   defer {
+       storageLock.unlock()
+   }
+       
         let reset = aggregatorTemporality == .delta
         let start = reset ? registeredReader.lastCollectedEpochNanos : startEpochNanos

         var points = [PointData]()

+   let keys = aggregatorHandles.keys
         aggregatorHandles.forEach { key, value in
             let point = value.aggregateThenMaybeReset(startEpochNano: start, endEpochNano: epochNanos, attributes: key, reset: reset)
             if reset {
-                aggregatorHandles.removeValue(forKey: key)
                 aggregatorHandlePool.append(value)
             }
             points.append(point)
         }
-        
+   if reset {
+       for removeKey in keys {
+           aggregatorHandles.removeValue(forKey: removeKey)
+       }
+   }
+       
         if points.isEmpty {
             return StableMetricData.empty
         }
@@ -92,6 +103,11 @@ public class SynchronousMetricStorage: SynchronousMetricStorageProtocol {
     }

     public func recordLong(value: Int, attributes: [String: OpenTelemetryApi.AttributeValue]) {
+   storageLock.lock()
+   defer {
+       storageLock.unlock()
+   }
+       
         do {
             let handle = try getAggregatorHandle(attributes: attributes)
             handle.recordLong(value: value, attributes: attributes)
@@ -103,6 +119,11 @@ public class SynchronousMetricStorage: SynchronousMetricStorageProtocol {
     }

     public func recordDouble(value: Double, attributes: [String: OpenTelemetryApi.AttributeValue]) {
+.     storageLock.lock()
+   defer {
+       storageLock.unlock()
+   }
+       
         do {
             let handle = try getAggregatorHandle(attributes: attributes)
             handle.recordDouble(value: value, attributes: attributes)
nachoBonafonte commented 4 months ago

Hi James, Do you have some sample code that reproduces this issue, it would be really useful to understand why it is crashing?

jhoongo commented 4 months ago

@nachoBonafonte It wasn't really reproducible as it was only crashing for only 0.05% of the total sessions that we had (more than 10M). The issue seems to be resolved by applying the patch pasted above; I'm not seeing any crash with the change anymore yet.

I think it's mainly when SynchronousMetricStorage is adding metric and exporting them at the same time that's causing the race condition to crash. It's removing the collected metric during the export and within the same/different thread, it's adding, but first getting the aggregatorHandle which might have already been removed by exporting.

Let me know what you think about the change.

nachoBonafonte commented 4 months ago

Are you calling forceflush manually while still recording metrics, or is it being called by the framework at the end? If it is called at the end by the framework then we should fix the framework to stop recording before calling force flush, if not you should be more careful calling it, as it should not be called without a real reason.

My concern about the change you propose is that it would hit performance for all code paths.

jhoongo commented 4 months ago

Are you calling forceflush manually while still recording metrics

Yes, for some cases, I do forceflush manually for a limited case, but I did observe this thread issue even before I start using the forceflush. In addition to that, I'm managing the queue to perform metric emission/forceflush, so it wouldn't be the case where we emit/forceflush at the same time.

It became more apparent and frequent with more metric emissions and occasional forceflush. So are you saying that I shouldn't use forceflush manually?

nachoBonafonte commented 4 months ago

Force flush is thought to be called when ending recording so you don't need to wait for the metrics to be exported to the next export cycle.

We should probably protect better the force flush and avoid recording to happen while flushing, but it would be great to validate if your issue is just for it or other cases.

Could you run your code with ThreadSanitizer and check when a concurrent access happen? Try also removing the force flushes and check if it stops warning

jhoongo commented 4 months ago

@nachoBonafonte It's bit difficult to test it deeply with ThreadSanitizer as we built internal library wrapping Otel SDK to be consumed by multiple apps.

I'm 99% sure that it still causes the crash even without calling forceFlush frequently, and by looking at the code, periodic exporter uses the same method as forceFlush to export metrics.

Looking at the Java implementation of SynchronousMetricStorage (link), it looks like that they do implement it in a way to manage the collect/record operations in sync. Understand that Swift SDK is still experimental, but I believe it requires a similar approach to manage record and collect operations? Perhaps not as posted diff in this comment, but more proper implementation?

Any thoughts?

nachoBonafonte commented 3 months ago

Could you try with the linked PR changes, I think it should fix your crashes