tenstorrent / tt-umd

User-Mode Driver for Tenstorrent hardware
Apache License 2.0
9 stars 5 forks source link

Add options to enable compilation with sanitizers #43

Closed patrickroberts closed 2 months ago

patrickroberts commented 2 months ago

Ticket

https://github.com/tenstorrent/tt-metal/issues/11453

Problem description

We want the ability to compile C++ with sanitizers to verify code's safety

What's changed

This pull request adds options for MemorySanitizer, ThreadSanitizer, and UndefinedBehaviorSanitizer to CMake and ensures that they're mutually exclusive during configuration.

joelsmithTT commented 2 months ago

Looks reasonable to me. Did you find anything good?

patrickroberts commented 2 months ago

Looks reasonable to me. Did you find anything good?

Not yet. I'd found a bug the other week with ASAN but something about my setup prevented me from reproducing it more recently, even though the code it identified still appears to be present in the codebase.

https://github.com/tenstorrent/tt-umd/blob/e7819efe6ab1057c31594526b4b059b823cc9cf1/device/tt_silicon_driver.cpp#L3615-L3616

This call to reserve at one point had been flagged by ASAN because on the next line the data pointer was passed as an output parameter into read_device_memory to be populated before it's valid to write to those addresses. When I run the tests, the control flow still passes through these lines even though I've been unable to reproduce the ASAN error, but these are the sorts of issues the sanitizers are supposed to catch. When I could reproduce it, one possible fix was to change reserve to resize (and yes I'm aware that unnecessarily initializes the data that's going to be overwritten) in order for the pointer to be a valid output address. A more appropriate but involved fix might be to continue to use reserve, but pass a reference to the vector, or a std::back_insert_iterator to read_device_memory to avoid the unnecessary initialization overhead of resize.

I saw this particular bug as a relatively minor issue, but have not yet had time to explore potential errors flagged by the other sanitizers, which is why I'm not currently trying to run these configurations on any post-commit workflows.