google / sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
Other
11.37k stars 1.02k forks source link

False positive when calling a function using dlsym #911

Open rnburn opened 6 years ago

rnburn commented 6 years ago

Code like this

extern "C" {
int __attribute((weak))
        OpenTracingMakeTracerFactory(const char* opentracing_version,
                                             const void** error_category,
                                             void** tracer_factory);
}  // extern "C"

...

  int (*make_tracer_factory)(const char*, const void**, void**) =
      reinterpret_cast<decltype(OpenTracingMakeTracerFactory)*>(
          dlsym(handle, "OpenTracingMakeTracerFactory"));

...

  const void* error_category = nullptr;
  void* tracer_factory = nullptr;
  const auto rcode = make_tracer_factory("1.0.0", &error_category,
                                         &tracer_factory);

gives the error


main.cpp:36:22: runtime error: call to function (unknown) through pointer to incorrect function type 'int (*)(const char *, const void **, void **)'
(liba.so+0x980): note: (unknown) defined here

even though it's the standard way of invoking a function from dlopen.

I put together a more complete example here.

kcc commented 6 years ago

Off the top of my head I don't know if this is an expected/desired behavior of ubsan or not. We don't use -fsanitize=function widely. To unblock yourself you can use https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#disabling-instrumentation-with-attribute-no-sanitize-undefined

rnburn commented 6 years ago

Undefined behavior sanitizer claims not to produce false positives, yet this is a legitimate use of casts.

When I asked about this on LLVM's IRC channel, they recommended putting in a report for it.

kcc commented 6 years ago

I bet they actually meant to report at bugs.llvm.org (not sure if its easy to register a new account ) Anyway, @zygoloid, WDYT about -fsanitize=function barking on the pointer returned by dlsym?

rnburn commented 6 years ago

I can put it there (I already have an account), but I didn't see any category for sanitizer bugs... and isn't this project meant to track these issues with sanitizers?

kcc commented 6 years ago

Yea, we kind of use both trackers interchangeably. This tracker is not-llvm-specific, e.g. gcc issues are also discussed here.

morehouse commented 6 years ago

@zygoloid: Ping.

zygoloid commented 6 years ago

Sorry I missed this before. I would imagine the problem is that use of RTLD_LOCAL is causing the type_infos for the function types to not match across DSOs. This is likely not just an fsan problem; I'd expect (eg) a function pointer thrown from the DSO to not be caught outside it either.

Maybe Clang needs to check whether a type_info object's type involves any user-defined types and force the use of string comparison for equality if not. Yuck.

We should discuss this on the clang lists, maybe @rjmccall has a better idea.

rjmccall commented 6 years ago

Just to state Richard's basic point explicitly: there's nothing wrong with this use of dlsym.

I think there are four key points in the analysis of this problem:

  1. dlopen is arguably wrong to interpret RTLD_LOCAL as permitting it to avoid resolving weak symbols in the DSO to entities already in the program. (The Darwin loader seems to do this resolution in my testing, FWIW.) If dlopen did this resolution correctly, this bug would be limited to calls between late-loaded DSOs.

  2. If we take the behavior of dlopen as a given, RTLD_LOCAL can break C++ semantics in a number of different ways, and I don't think it's the ABI's responsibility to try to compensate for that. There is certainly no way to compensate for RTLD_LOCAL's impact on static variables in inline functions or static data members of class templates.

  3. I don't think there's an ABI-compatible way to compensate for this anyway. There is no bit we can set in the standard ABI that will force the use of string comparison for type_info equality, and there is no space to add such a bit that wouldn't break existing compiled code that manipulates type_info objects. If you're considering breaking ABI to try to solve this, you should break ABI to adopt a Darwin-ARM64-like approach that does not rely on vague linkage across DSOs.

  4. UBSan can fix this bug without requiring any ABI changes by simply not using RTTI anymore for the type matching. Just make a string literal with the mangled function type in it or something, and you will magically reduce the binary-size and load-time impact of this feature while avoiding a long chain of headaches.

pitrou commented 5 years ago

This does not seem to happen only with dlsym, but even with std::function (or at least some uses thereof?).

Flamefire commented 5 years ago

I ran into the same issue, but without the weak attribute: Simply an extern "C" function in a shared library later loaded with dlopen(..., RTLD_LOCAL). In this case it is a simple function of type unsigned int (*)().

Note that this is extremely common for plugin systems where every plugin must provide some common functions which are then dlopen'ed later. Obviously RTLD_GLOBAL must not be used or symbols from different plugins would clash.

ezyang commented 4 years ago

We at the PyTorch project also ran into this. More context at https://github.com/pytorch/pytorch/pull/28536/ I'm planning to work around it by LD_PRELOAD ing the library in question into Python at startup when I want to run under UBSAN (I have to do this already anyway for the ASAN library, so it's no big deal.)

chrstphrchvz commented 11 months ago

4. UBSan can fix this bug without requiring any ABI changes by simply not using RTTI anymore for the type matching.

Has this effectively been accomplished in LLVM Clang 17 by https://github.com/llvm/llvm-project/commit/46f366494f3c ?