microsoft / cpp_client_telemetry

1DS C++ SDK
Apache License 2.0
85 stars 48 forks source link

ILogConfiguration operator[] causes a crash in multiple instances. #706

Closed agavipul closed 3 years ago

agavipul commented 3 years ago

Describe your environment. Crash is observed in Teams iOS Clients in SDK version 3.4.295. Crash is in operator[] method of ILogConfiguration

Steps to reproduce. The issue is when flushAndTeardown is called, it sets LogConfiguration instance to nullptr. Now, in parallel, if operator[] method of LogConfiguration is called on current LogConfiguration on another thread, it causes a crash.

UPD [MG]: discussed below that this hypothesis does not make much sense, since LogConfiguration instance is never nullptr. No reproducible steps given to investigate this from C++ code perspective. Please make sure the code does not modify the ILogConfiguration map after SDK initialization and before the LogManager is torn down.

What is the expected behavior? operator[] method should return correct value or null(if not present) without causing any crash.

What is the actual behavior? operator[] method tries to access a nullptr object and causes a NullPointerException and crashes.

Additional context.

Please find the stack trace for multiple crashes

Screenshot 2020-10-22 at 9 35 20 PM

Screen Shot 2020-10-25 at 5 30 38 PM Screen Shot 2020-10-25 at 5 32 53 PM

maxgolov commented 3 years ago

Now, in parallel, if operator[] method of LogConfiguration is called on current LogConfiguration on another thread..

Could you please clarify the flow: if you are accessing ILogConfiguration object in your code, i.e. if it's your code doing it in parallel without sync-waiting for FlushAndTeardown completion? This is going to be invalid usage of SDK.

The issue is when flushAndTeardown is called, it sets LogConfiguration instance to nullptr.

Some of these statements do not make sense to me from C++ SDK perspective. C++ core code does not set the LogConfiguration to nullptr, i.e. it does not destroy / invalidate the ILogConfiguration object per se. We actually have tests that repeatedly reuse the same ILogConfiguration object.

If this is an issue in the Obj-C projection for Mobile only (not applicable to core C++ code), then my team does not provide support for that part. Your best approach is to send in a PR to fix it.

maxgolov commented 3 years ago

Default ILogConfiguration in C++ core - singleton - LogManagerBase.hpp template, is a global static object: https://github.com/microsoft/cpp_client_telemetry/blob/1efb5dac6b8696a05e2e28cd0726b30c26552598/lib/include/public/LogManagerBase.hpp#L194

When you say it sets LogConfiguration instance to nullptr - could you please clarify on this, what instance you mean exactly? If it's the projection layer in Obj-C that supplies a custom reference to configuration, then it is the outside code that has to ensure that the object that you pass by reference is not being concurrently modified during Initialize and FlushAndTeardown API calls; and that it remains valid while the method are still running, i.e. you cannot pass a ref to temporary, for example, and run code async that may execute when the temporary is gone.

maxgolov commented 3 years ago

@agavipul - I'm assigning this to you, since you added this code and introduced the +(nullable ILogger *)initializeLogManager:(nonnull NSString *)tenantToken withConfig:(nullable NSDictionary *)config method recently.

Several issues here with the implementation of the method:

auto& config = LogManager::GetLogConfiguration()

Then copy over your Obj-C layer dictionary into that config. This ensures that your settings are stored permanently in config ( inside the LogManager::currentConfig ), and that that outlives a single concrete LogManager instance. You do not need a temporary for that. You'd assign your newly passed settings straight to config. That should resolve the issues that you are seeing.

agavipul commented 3 years ago

Hey @maxgolov, I temporaily allocate the config on stack but when passed to static ILogger* Initialize(const std::string& tenantToken, ILogConfiguration& configuration) method of LogManagerBase, it actually makes changes to persistent static object only. So I don't think this is the issue here.

Also, can you check the trace and see what might be the possible reason for exception here?

Also to reference to the changes I made, this issue exist before this PR itself as pointed by OneDrive folks.

maxgolov commented 3 years ago

@agavipul - please provide a repro for common C++ core code. Temporary in Obj-C does not make practical sense and looks highly suspicious. My team does not offer support for Obj-C code. You may discuss this with @kindbe or @eduardo-camacho .

The issue is when flushAndTeardown is called, it sets LogConfiguration instance to nullptr.

Please explain / elaborate on this one. As I mentioned, this statement does not make any sense from C++ core code perspective.

maxgolov commented 3 years ago

We do stress-test this re-Initialization scenario in C++ functional tests here: https://github.com/microsoft/cpp_client_telemetry/blob/1efb5dac6b8696a05e2e28cd0726b30c26552598/tests/functests/APITest.cpp#L992

This is not the only unit / functional test where we test re-Initialization.

agavipul commented 3 years ago

But @maxgolov , let aside my hypothesis of nullptr of ILogConfiguration, if we look at the stack trace of crash, it does come from SDK accessing operator[] method of LogConfiguration internally. If you look, it is always crashing when it tries to get access to "maxBlobSize" property of the LogConfiguration. Below are the screenshots of places from which crash is emerging. It tries to fetch maxBlobSize and crashes.

Screenshot 2020-10-26 at 10 16 24 AM Screenshot 2020-10-26 at 10 16 14 AM
agavipul commented 3 years ago

And Teams doesn't update/tinker with this property of SDK. If there was some issue with the temporary stack allocation of LogConfig, then crashes should happen while accessing other props as well.

maxgolov commented 3 years ago

let aside my hypothesis of nullptr of ILogConfiguration

Yes, let's totally forget about it - because that is not the right hypothesis.

maxgolov commented 3 years ago

How I would suggest you to debug it in Obj-C wrapper (which my team does not support) :

  1. Avoid the temporary here:
        ILogConfiguration logManagerConfig;
  1. Avoid this assignment here:

        auto& defaultConfig = LogManager::GetLogConfiguration();
        logManagerConfig = defaultConfig;
  2. Check what you get in jsonString here:

https://github.com/microsoft/cpp_client_telemetry/blob/1efb5dac6b8696a05e2e28cd0726b30c26552598/wrappers/obj-c/ODWLogManager.mm#L90

And when you assign (below), does your JSON configuration contain the required tpm section? How does it work if you accept a dictionary, but then use this as a deep-nested JSON object? SDK requires a deep-nested JSON object for some parameters.

You can find example of valid JSON config here: https://github.com/microsoft/cpp_client_telemetry/blob/master/examples/cpp/EventSender/direct.config esp. the tpm section.

                NSString *jsonString = [[NSString alloc] initWithData:jsonData encoding:NSUTF8StringEncoding];
                logManagerConfig = MAT::FromJSON([jsonString UTF8String]);

This looks like the Obj-C wrapper code may be permanently damaging / destroying the tpm settings map, or not re-creating the necessary maxBlobSize attribute? Please verify this. In general, you can avoid the unnecessary copies back and forth, if you operate on the original reference to ILogConfiguration , reference to that config instance that statically lives in the singleton.

  1. Make sure you are not concurrently modifying the ILogConfiguration while the FlushAndTeardown is running. Currently the map is not thread-safe for writes / modifications. You cannot modify it concurrently with another thread still using it. You set it up before Initialize, you wait for FlushAndTeardown, then you may tweak it after FlushAndTeardown and before you Initialize again for the next LogManager session.

C++ core repro would be nice. But I think the problem is unique to Obj-C wrapper at this point. Just to reiterate on my team support model mentioned in README.md - Only C and C++ API surface is supported - any other language wrappers are provided as-is, and not supported... .. I think the issue you are facing is due to series of unfortunate events in the Obj-C wrapper. Please debug it on your end. OR - if you have a C++ repro, we can dissect that!

agavipul commented 3 years ago

@maxgolov I did some debugging on my end. Following are the things I maintained across my debugging: I avoided using re-init of logger i.e. only initialised the logger once and did not use flushAndTeardown. I made sure to keep using the default LogConfiguration.

Still I observed the crash

I put the debug point on the m_config object to check whether it contains the keys as expected or not. Its data is correct. Also, it works fine for multiple different calls. Then somehow, its value is updated and it starts to fail. We are not updating anything on our end.

Screenshot 2020-10-26 at 2 21 44 PM
maxgolov commented 3 years ago

Please provide a trivial reproducible example in C++.

Just to be clear - my team (Azure Platform) does not support Obj-C projection nor Mobile platforms. We also don't assist debugging integration problems in products. Please add a trivial reproducible example in C++ Unit or Functional test that illustrates the problem (no Swift, no Obj-C please).

If your app worked with a previous version of SDK, please bisect to identify the last one that worked, then debug top to bottom. Useful link on debug strategies: https://topologyeyewear.github.io/engineering-blog/2018/03/07/heap_corruption/

maxgolov commented 3 years ago

@kindbe @eduardo-camacho @mkoscumb to help with Obj-C layer.

mkoscumb commented 3 years ago

Discussed this with Vipul on a call yesterday. The issue isn't anywhere in the SDK (C++ or Obj-C wrappers) rather it's in the caller having a background thread running as the globals are being torn down.

maxgolov commented 3 years ago

@mkoscumb @agavipul - this appears like invalid use of SDK, not calling into FlushAndTeardown ? Can this issue be closed?

mkoscumb commented 3 years ago

Closing as we've got a resolution at this time.