mandiant / STrace

A DTrace on Windows Reimplementation
MIT License
328 stars 41 forks source link

Add ETW event publishing #15

Closed kirbyUK closed 1 year ago

kirbyUK commented 1 year ago

Initial code to add ETW logging for plugins, as described in #3 here.

Here's a screenshot showing some ETW events logged from a plugin.

etw-log-screenshot

Design considerations

I have tried to keep the interface as close as possible to the existing DTrace ETW one, but doing so has caused a couple of design issues I'd like to talk over.

The first is that I implemented the variadic function using C++ templates, but they're not really suitable at all. Since the APIs object is global, and made in the driver, we can only define one log function with all the specific parameters, and all plugins have to call that. That is, if you wanted different parameters, you'd have to recompile and reinstall the driver. I think for this we're going to have to use C-style variadic functions, but that will diverge from the linked example at least slightly because we'll need some kind of number-of-fields parameter. I don't really see any way around it unless you can think of something else, however, or unless we stray much more from the existing DTrace style (e.g. use classes?).

Secondly, the code spins up the event - including all of the memory allocations needed, fires the event, then tears it all down again. It would definitely be more performant to not do this, but I'm not too sure what we'd cache providers on, or where. I haven't benchmarked how much of a difference it would make, it might not really be an issue.

As it stands at the moment, I have a recreation of the Microsoft example as a plugin. Give it a go and let me know what you think.

google-cla[bot] commented 1 year ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

stevemk14ebr commented 1 year ago

This is simply put, excellent. Thank you so much for continuing your original PR and completing this!

I agree with you that templates are not suitable for this, we can probably port this to a printf style interface without too much trouble, at least at the interface boundary that Dlls call this must be done - we perhaps can keep the templates for child calls. As far as the caching, we could make a global vector that contains a structure that holds all of the fields needed to register a new provider. The user may be required to do a one time call to 'setupEtwTrace' or something like that with the necessary details which would dispatch to the driver where all the ETW structures and allocations occur, then the results are saved to said structure, and the structure finally saved into the global vector. Users could register multiple providers this way and refer to them later if necessary inside of the ETW trace call. All plugin Dll would therefore just need 'setupEtwTrace' and 'EtwTrace' calls to be exposed.

I should have some time to review this in depth within the next few weeks. Please sign the CLA when possible :)

stevemk14ebr commented 1 year ago

Thanks for the continued work! Let me know when you're ready for a second review 😄

kirbyUK commented 1 year ago

I've got the two big changes - the C-style variadic functions and the caching - down now. I've still got a little bit more to do, but feel free to start having a look whenever you can if you'd like :)

I've had to use a few patterns I'm not a huge fan of for the caching, namely constructing objects in initialise functions because I can't throw exceptions, and using a destruct function because I can't use any destructors. This looks like it's just standard restrictions for a kernel driver if I'm not mistaken? I think given the circumstances they're fine, let me know if you have any thoughts.

stevemk14ebr commented 1 year ago

Yea you can't use destructors for global class objects because destructors of this type rely on the atexit registration performed inside of a C++ static initializer list which gets invoked by the runtime at process startup (for usermode binaries). We don't have a CRT so this would break. You can use destructors for stack or heap allocated object still though, as the destructor call is inserted inline by the compiler on object deletion/scope lifetime end.

I believe you can however use constructors still. You're correct exceptions are a no-no but you could initialize each pointer to nullptr, and if an allocation fails just ignore that failure. Each member function that uses a pointer that is set in the constructor would first branch to exit if any required pointer is null, basically defer constructor errors until later. I don't like this pattern at all though, you'll see I mostly use the initializer member function pattern myself.

EDIT: I've explicitly called out some changes I think we can do regarding destructors in my review

stevemk14ebr commented 1 year ago

I took a look at all the destructor stuff. I think I've got it in a working state, can you try this and let me know if it works for you?

git apply destructors.patch

destructors.patch

If this is ok and works for you then I think we're ready to merge ;)

EDIT: How did you get traceview to decode the message field? I get a decode error when I select Auto. I setup my trace to use the GUID d7827ef0-cc9e-4b7c-a322-be5280ff3622

image

EDIT 2: Oh weird ok this breaks decoding too I see what you're saying. Hmm, maybe we just ignore it until I can figure this out

stevemk14ebr commented 1 year ago

This works so I'm merging the bulk of it now. We can open additional PRs for adjustments (such as maybe destructors) later.

stevemk14ebr commented 1 year ago

Really nice work on this one @kirbyUK ! Future work includes switching out the textual logger system entirely for this ETW provider mechanism and I'll probably eventually write some sort of frontend usermode UI to consume this data nicely. We still need to figure out ETW event consumption parsing iirc yes?

fyi: https://twitter.com/stevemk14ebr/status/1692552935421386818?s=20

stevemk14ebr commented 1 year ago

I figured out the destructor problem. When EtwTrace called push_back to put the new EtwProvider into the vector the temporary was being destructed. This called EtwUnregister which broke tracing. There was a minor bug in FindProvider as well that caused the compiler to erase the call entirely as it thought it was a no-op so this provider was constantly re-made and destroyed. It's a miracle it worked ;)

kirbyUK commented 1 year ago

Hi, sorry for the radio silence on this - I was away for the weekend with no mobile signal!

Thanks a lot for taking a look at the destructors, sorry about the initial bug. I've been hit by the problem of returning temporaries quite a few times in the past weeks, I should have spotted it! 😅

You're right in that we still need to figure out event decoding. I popped tdh.dll up in IDA once and when MS say it's a userspace library they really mean it, there's barely any kernel calls there and it's quite complicated! Now that I know about actually sending events I have a better idea on how they work though, so that could provide a decent start.

Thanks again!

stevemk14ebr commented 1 year ago

Looks like TDH.dll has symbols though ;)

You can pull the PDB using https://twitter.com/stevemk14ebr/status/1694390773435805899?s=20