Closed wmatw closed 4 years ago
Hi Mathieu, ? krabsetw runs entirely in a single thread. Do you have a repro that demonstrates this race condition? I’m not sure how it would be possible to encounter given that schema_locator is only expected to be used in the ETW callback which is single threaded per trace session.
Multiple trace sessions don’t share the schema locator and must build individual caches.
Thanks,
Zac
I think the code is not working as you expected:
See schema_locator.hpp
line 137
private:
static schema_locator singleton_;
if you are creating multiple sessions on different threads (classic when you are using both kernel and user providers by example - see example: kernel_and_user_trace_001.cpp ) every ETW callback called on a different thread will access the same singleton.
If we can remove the dangerous (and unnecessary, I think) reset function from this class, we can safely change the storage class of the cache to be static thread_local
, in that case, we can remove the ugly singleton, have a real lock free cache and BONUS the OS will deallocate the memory when the thread is gone (no need to have the reset function anyway).
If you support my idea I can make a PR ...
Mathieu
Ah I see now. I had thought that cache was thread_local
.
Removing reset
would be an API breaking change that I'm not sure we should do. We should leave it as is for the time being.
That said, adding thread_local should not be a problem given that I'd estimate the vast majority of consumers are not listening to the same ETW provider on multiple threads.
@swannman - What do you think? If we add thread_local
and fix up the reset
function, we can remove the locking in this class. From consumer side, this behavior doesn't change. It introduces, in a corner case, some increased memory usage.
That will improve thread local performance and since most consumers won't be listening on multiple threads to the same provider, there was never any meaningful lock contention to begin with I don't think?
I support this change.
I’m okay with removing reset()
if that’s what we would have done if we had spotted this originally. If we no longer need it but we want to preserve API compatibility then we could make it a no-op and update the code comments to that effect.
I have a project which uses your awesome .NET wrapper. I just refactored it to use ~15 threads each running a UserTrace for a different provider - a couple of which are quite high volume.
And I started getting System.AccessViolationException exceptions. The stack trace points to krabs::schema_locator::get_event_schema(_EVENT_RECORD& record) so I'm guessing that I've hit this issue?
@jdu2600 are you using the same krabsetw instance to consume both the kernel debug trace provider and user-mode ETW providers?
Hi Matt,
I'm not using the kernel provider - at least not directly.
I originally had 1 UserTrace() with 15 Provider() and everything worked smoothly - but I was dropping events due to the high volume.
So I tried having 15 UserTrace() each with its own thread in my process and with ~1 Provider() each. I am subscribing to providers such as Microsoft-Windows-Kernel-Process, but as far as I know this isn't the "NT Kernel Logger" kernel provider but some kind of convenience (...that I'm using without fully understanding).
Just bumped into this one again after enabling a high volume kernel provider. :-( Two threads this time - 1 UserTrace() and 1 KernelTrace()
System.AccessViolationException HResult=0x80004003 Message=Attempted to read or write protected memory. This is often an indication that other memory is corrupt. Source=Microsoft.O365.Security.Native.ETW StackTrace: at std._Hash ... in c:\program files (x86)\microsoft visual studio\2017\enterprise\vc\tools\msvc\14.16.27023\include\list:line 72 at std._Hash ... in c:\program files (x86)\microsoft visual studio\2017\enterprise\vc\tools\msvc\14.16.27023\include\xhash:line 301
at std.unordered_map ... in c:\program files (x86)\microsoft visual studio\2017\enterprise\vc\tools\msvc\14.16.27023\include\tuple:line 236 at krabs.schema_locator.get_event_schema(schema_locator , _EVENT_RECORD record) in d:\a\1\s\krabs\krabs\schema_locator.hpp:line 166 at Microsoft.O365.Security.ETW.KernelProvider.EventNotification(_EVENT_RECORD* A_0) in d:\a\1\s\krabs\krabs\schema.hpp:line 203
Ack, let's dig into this once we've merged your outstanding PRs.
@wmatw were you still interested in creating a PR for your suggested change?
I had a quick look into this last night - and noticed a potential hiccup on the managed side.
error C2483: 'krabs::schemalocator::singleton': object with destructor or non-aggregate type cannot be declared with thread storage duration in managed code
@wmatw were you still interested in creating a PR for your suggested change?
I am really sorry but I don't have a time frame to contribute.
Resolved by #96
Hello, in the file
schema_locator.hpp
line 145 about the following function:The access to the unorderedmap is not protected by synchronization mechanism at `auto& buffer = cache[key];` an insertion can occur and that can result in a rehashing of the container, then all iterators will be invalidated. If another thread is reading the container at the same exact moment, the behavior will be undefined. I think that using a "singleton double checked locking pattern" for insertion is not enough, a full locking is required to completely eliminate the threat... Mathieu