Closed agavipul closed 3 years ago
Now, in parallel, if operator[] method of LogConfiguration is called on current LogConfiguration on another thread..
Could you please clarify the flow: if you are accessing ILogConfiguration
object in your code, i.e. if it's your code doing it in parallel without sync-waiting for FlushAndTeardown
completion? This is going to be invalid usage of SDK.
The issue is when flushAndTeardown is called, it sets LogConfiguration instance to nullptr.
Some of these statements do not make sense to me from C++ SDK perspective. C++ core code does not set the LogConfiguration
to nullptr, i.e. it does not destroy / invalidate the ILogConfiguration
object per se. We actually have tests that repeatedly reuse the same ILogConfiguration
object.
If this is an issue in the Obj-C projection for Mobile only (not applicable to core C++ code), then my team does not provide support for that part. Your best approach is to send in a PR to fix it.
Default ILogConfiguration
in C++ core - singleton - LogManagerBase.hpp
template, is a global static object:
https://github.com/microsoft/cpp_client_telemetry/blob/1efb5dac6b8696a05e2e28cd0726b30c26552598/lib/include/public/LogManagerBase.hpp#L194
When you say it sets LogConfiguration instance to nullptr
- could you please clarify on this, what instance
you mean exactly? If it's the projection layer in Obj-C that supplies a custom reference to configuration, then it is the outside code that has to ensure that the object that you pass by reference is not being concurrently modified during Initialize
and FlushAndTeardown
API calls; and that it remains valid while the method are still running, i.e. you cannot pass a ref to temporary, for example, and run code async that may execute when the temporary is gone.
@agavipul - I'm assigning this to you, since you added this code and introduced the +(nullable ILogger *)initializeLogManager:(nonnull NSString *)tenantToken withConfig:(nullable NSDictionary *)config
method recently.
Several issues here with the implementation of the method:
ILogConfiguration logManagerConfig;
- appears to be temporary, allocated on stack? This is simply not right.
logManagerConfig = defaultConfig
- not sure what is the purpose of this assignment. You should actually get a reference to persistent static object (that is not dying with an instance of LogManager
) as follows:
auto& config = LogManager::GetLogConfiguration()
Then copy over your Obj-C layer dictionary into that config
. This ensures that your settings are stored permanently in config
( inside the LogManager::currentConfig
), and that that outlives a single concrete LogManager
instance. You do not need a temporary for that. You'd assign your newly passed settings straight to config
. That should resolve the issues that you are seeing.
Hey @maxgolov, I temporaily allocate the config on stack but when passed to
static ILogger* Initialize(const std::string& tenantToken, ILogConfiguration& configuration)
method of LogManagerBase, it actually makes changes to persistent static object only. So I don't think this is the issue here.
Also, can you check the trace and see what might be the possible reason for exception here?
Also to reference to the changes I made, this issue exist before this PR itself as pointed by OneDrive folks.
@agavipul - please provide a repro for common C++ core code. Temporary in Obj-C does not make practical sense and looks highly suspicious. My team does not offer support for Obj-C code. You may discuss this with @kindbe or @eduardo-camacho .
The issue is when flushAndTeardown is called, it sets LogConfiguration instance to nullptr.
Please explain / elaborate on this one. As I mentioned, this statement does not make any sense from C++ core code perspective.
We do stress-test this re-Initialization scenario in C++ functional tests here: https://github.com/microsoft/cpp_client_telemetry/blob/1efb5dac6b8696a05e2e28cd0726b30c26552598/tests/functests/APITest.cpp#L992
This is not the only unit / functional test where we test re-Initialization.
But @maxgolov , let aside my hypothesis of nullptr of ILogConfiguration, if we look at the stack trace of crash, it does come from SDK accessing operator[] method of LogConfiguration internally. If you look, it is always crashing when it tries to get access to "maxBlobSize" property of the LogConfiguration. Below are the screenshots of places from which crash is emerging. It tries to fetch maxBlobSize and crashes.
And Teams doesn't update/tinker with this property of SDK. If there was some issue with the temporary stack allocation of LogConfig, then crashes should happen while accessing other props as well.
let aside my hypothesis of nullptr of ILogConfiguration
Yes, let's totally forget about it - because that is not the right hypothesis.
How I would suggest you to debug it in Obj-C wrapper (which my team does not support) :
ILogConfiguration logManagerConfig;
Avoid this assignment here:
auto& defaultConfig = LogManager::GetLogConfiguration();
logManagerConfig = defaultConfig;
Check what you get in jsonString
here:
And when you assign (below), does your JSON configuration contain the required tpm
section? How does it work if you accept a dictionary, but then use this as a deep-nested JSON object? SDK requires a deep-nested JSON object for some parameters.
You can find example of valid JSON config here: https://github.com/microsoft/cpp_client_telemetry/blob/master/examples/cpp/EventSender/direct.config esp. the tpm
section.
NSString *jsonString = [[NSString alloc] initWithData:jsonData encoding:NSUTF8StringEncoding];
logManagerConfig = MAT::FromJSON([jsonString UTF8String]);
This looks like the Obj-C wrapper code may be permanently damaging / destroying the tpm
settings map, or not re-creating the necessary maxBlobSize
attribute? Please verify this. In general, you can avoid the unnecessary copies back and forth, if you operate on the original reference to ILogConfiguration
, reference to that config instance that statically lives in the singleton.
ILogConfiguration
while the FlushAndTeardown
is running. Currently the map is not thread-safe for writes / modifications. You cannot modify it concurrently with another thread still using it. You set it up before Initialize
, you wait for FlushAndTeardown
, then you may tweak it after FlushAndTeardown
and before you Initialize
again for the next LogManager session.C++ core repro would be nice. But I think the problem is unique to Obj-C wrapper at this point. Just to reiterate on my team support model mentioned in README.md
- Only C and C++ API surface is supported - any other language wrappers are provided as-is, and not supported... .. I think the issue you are facing is due to series of unfortunate events in the Obj-C wrapper. Please debug it on your end. OR - if you have a C++ repro, we can dissect that!
@maxgolov I did some debugging on my end. Following are the things I maintained across my debugging: I avoided using re-init of logger i.e. only initialised the logger once and did not use flushAndTeardown. I made sure to keep using the default LogConfiguration.
Still I observed the crash
I put the debug point on the m_config object to check whether it contains the keys as expected or not. Its data is correct. Also, it works fine for multiple different calls. Then somehow, its value is updated and it starts to fail. We are not updating anything on our end.
Please provide a trivial reproducible example in C++.
Just to be clear - my team (Azure Platform) does not support Obj-C projection nor Mobile platforms. We also don't assist debugging integration problems in products. Please add a trivial reproducible example in C++ Unit or Functional test that illustrates the problem (no Swift, no Obj-C please).
If your app worked with a previous version of SDK, please bisect to identify the last one that worked, then debug top to bottom. Useful link on debug strategies: https://topologyeyewear.github.io/engineering-blog/2018/03/07/heap_corruption/
@kindbe @eduardo-camacho @mkoscumb to help with Obj-C layer.
Discussed this with Vipul on a call yesterday. The issue isn't anywhere in the SDK (C++ or Obj-C wrappers) rather it's in the caller having a background thread running as the globals are being torn down.
@mkoscumb @agavipul - this appears like invalid use of SDK, not calling into FlushAndTeardown
? Can this issue be closed?
Closing as we've got a resolution at this time.
Describe your environment. Crash is observed in Teams iOS Clients in SDK version 3.4.295. Crash is in operator[] method of ILogConfiguration
Steps to reproduce.
The issue is when flushAndTeardown is called, it sets LogConfiguration instance to nullptr. Now, in parallel, if operator[] method of LogConfiguration is called on current LogConfiguration on another thread, it causes a crash.UPD [MG]: discussed below that this hypothesis does not make much sense, since LogConfiguration instance is never nullptr. No reproducible steps given to investigate this from C++ code perspective. Please make sure the code does not modify the ILogConfiguration map after SDK initialization and before the LogManager is torn down.
What is the expected behavior? operator[] method should return correct value or null(if not present) without causing any crash.
What is the actual behavior? operator[] method tries to access a nullptr object and causes a NullPointerException and crashes.
Additional context.
Please find the stack trace for multiple crashes