Possible synchronization issue in schema_locator

wmatw commented 5 years ago

Hello, in the file schema_locator.hpp line 145 about the following function:

inline const PTRACE_EVENT_INFO schema_locator::get_event_schema(const EVENT_RECORD& record)
    {
        // check the cache
        auto key = schema_key(record);
        auto& buffer = cache_[key];

        // return if there's a cache hit
        if(buffer) return (PTRACE_EVENT_INFO)(&buffer[0]);

        auto temp = get_event_schema_from_tdh(record);

        // multiple threads may end up trying to save their
        // temp objects to the cache so we need to lock
        // so only the first entry should be cached.

        if(!buffer)
        {
            scope_lock l(sync_);
            if (!buffer) buffer.swap(temp);
        }

        return (PTRACE_EVENT_INFO)(&buffer[0]);
    }

The access to the unorderedmap is not protected by synchronization mechanism at `auto& buffer = cache[key];` an insertion can occur and that can result in a rehashing of the container, then all iterators will be invalidated. If another thread is reading the container at the same exact moment, the behavior will be undefined. I think that using a "singleton double checked locking pattern" for insertion is not enough, a full locking is required to completely eliminate the threat... Mathieu

zacbrown commented 5 years ago

Hi Mathieu, ? krabsetw runs entirely in a single thread. Do you have a repro that demonstrates this race condition? I’m not sure how it would be possible to encounter given that schema_locator is only expected to be used in the ETW callback which is single threaded per trace session.

Multiple trace sessions don’t share the schema locator and must build individual caches.

Thanks,

Zac

wmatw commented 5 years ago

I think the code is not working as you expected: See schema_locator.hpp line 137

private:
        static schema_locator singleton_;

if you are creating multiple sessions on different threads (classic when you are using both kernel and user providers by example - see example: kernel_and_user_trace_001.cpp ) every ETW callback called on a different thread will access the same singleton. If we can remove the dangerous (and unnecessary, I think) reset function from this class, we can safely change the storage class of the cache to be static thread_local, in that case, we can remove the ugly singleton, have a real lock free cache and BONUS the OS will deallocate the memory when the thread is gone (no need to have the reset function anyway). If you support my idea I can make a PR ... Mathieu

zacbrown commented 5 years ago

Ah I see now. I had thought that cache was thread_local.

Removing reset would be an API breaking change that I'm not sure we should do. We should leave it as is for the time being.

That said, adding thread_local should not be a problem given that I'd estimate the vast majority of consumers are not listening to the same ETW provider on multiple threads.

@swannman - What do you think? If we add thread_local and fix up the reset function, we can remove the locking in this class. From consumer side, this behavior doesn't change. It introduces, in a corner case, some increased memory usage.

That will improve thread local performance and since most consumers won't be listening on multiple threads to the same provider, there was never any meaningful lock contention to begin with I don't think?

swannman commented 5 years ago

I support this change.

I’m okay with removing reset() if that’s what we would have done if we had spotted this originally. If we no longer need it but we want to preserve API compatibility then we could make it a no-op and update the code comments to that effect.

jdu2600 commented 5 years ago

I have a project which uses your awesome .NET wrapper. I just refactored it to use ~15 threads each running a UserTrace for a different provider - a couple of which are quite high volume.

And I started getting System.AccessViolationException exceptions. The stack trace points to krabs::schema_locator::get_event_schema(_EVENT_RECORD& record) so I'm guessing that I've hit this issue?

swannman commented 5 years ago

@jdu2600 are you using the same krabsetw instance to consume both the kernel debug trace provider and user-mode ETW providers?

jdu2600 commented 5 years ago

Hi Matt,

I'm not using the kernel provider - at least not directly.

I originally had 1 UserTrace() with 15 Provider() and everything worked smoothly - but I was dropping events due to the high volume.

So I tried having 15 UserTrace() each with its own thread in my process and with ~1 Provider() each. I am subscribing to providers such as Microsoft-Windows-Kernel-Process, but as far as I know this isn't the "NT Kernel Logger" kernel provider but some kind of convenience (...that I'm using without fully understanding).

John

jdu2600 commented 4 years ago

Just bumped into this one again after enabling a high volume kernel provider. :-( Two threads this time - 1 UserTrace() and 1 KernelTrace()

System.AccessViolationException HResult=0x80004003 Message=Attempted to read or write protected memory. This is often an indication that other memory is corrupt. Source=Microsoft.O365.Security.Native.ETW StackTrace: at std._Hash ... in c:\program files (x86)\microsoft visual studio\2017\enterprise\vc\tools\msvc\14.16.27023\include\list:line 72 at std._Hash ... in c:\program files (x86)\microsoft visual studio\2017\enterprise\vc\tools\msvc\14.16.27023\include\xhash:line 301
at std.unordered_map ... in c:\program files (x86)\microsoft visual studio\2017\enterprise\vc\tools\msvc\14.16.27023\include\tuple:line 236 at krabs.schema_locator.get_event_schema(schema_locator , _EVENT_RECORD record) in d:\a\1\s\krabs\krabs\schema_locator.hpp:line 166 at Microsoft.O365.Security.ETW.KernelProvider.EventNotification(_EVENT_RECORD* A_0) in d:\a\1\s\krabs\krabs\schema.hpp:line 203

swannman commented 4 years ago

Ack, let's dig into this once we've merged your outstanding PRs.

swannman commented 4 years ago

@wmatw were you still interested in creating a PR for your suggested change?

jdu2600 commented 4 years ago

I had a quick look into this last night - and noticed a potential hiccup on the managed side.

error C2483: 'krabs::schemalocator::singleton': object with destructor or non-aggregate type cannot be declared with thread storage duration in managed code

wmatw commented 4 years ago

@wmatw were you still interested in creating a PR for your suggested change?

I am really sorry but I don't have a time frame to contribute.

swannman commented 4 years ago

Resolved by #96

microsoft / krabsetw

Possible synchronization issue in schema_locator #61