rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.66k stars 12.49k forks source link

Support whole program devirtualization #68262

Open tmandry opened 4 years ago

tmandry commented 4 years ago

When doing LTO, LLVM supports a devirtualization pass that converts some indirect calls to static, inlineable calls. This can unlock a lot of optimizations when dynamic dispatch is used frequently in a program.

I don't know how hard this would be to support in Rust. Clang has a flag -fwhole-program-vtables which we would need to emulate. From there it seems like it might "just work". I'm opening this issue to start collecting more information.

I also found #11915 which mentions this, but not sure if any of the information there is still accurate.

sanxiyn commented 4 years ago

In LLVM terms, this means emitting type metadata and using llvm.type.test.

flip1995 commented 3 years ago

I'd be interested to look into this issue. Since I'm not that familiar with this part of the compiler (but want to get more familiar with it) I first wanted to ask what would be involved in fixing this.

AFAICT to fix this, a flag would have to be added to the compiler and if that flag is set, (some) types would have to be marked with the !vcall_visibility metadata when lowering to LLVM IR. We should be able to query the visibility of items with the TyCtxt::visibility query which we would then use to convert to the corresponding !vcall_visibility like this:

rustc_middle::ty::Visibility !vcall_visibility
Public 0 (Public)
Restricted 2 (Translation Unit)
Invisible 1 (Linkage Unit)

@tmandry could you give me some pointers where to start with this issue? Do you have an estimate how much work would be involved here?

bjorn3 commented 3 years ago
rustc_middle::ty::Visibility !vcall_visibility
[...] [...]
Restricted 2 (Translation Unit)
Invisible 1 (Linkage Unit)

This is not correct. A trait marked as invisible or restricted can still used by a foreign codegen unit or a foreign shared library. For example if it is used by a public generic or #[inline] function. It can also be implemented for downstream types if there is a blanket implementation.

flip1995 commented 3 years ago

Ah right of course, so it has to depend on the visibility of the trait and its implementors. So if the trait is public, or if there are any generics involved or blanket impls exist, than it should be marked as !vcall_visibility=0 (pub). But if the trait is not public and only specific implementations exist and the types of those specific impls are restricted, translation unit or linkage unit could be chosen.

Aaron1011 commented 2 years ago

It looks like we're already getting some form of this for free. The following code:

trait Foo {
    fn my_method(&self) {}
}

impl Foo for bool {
    fn my_method(&self) {
        std::process::abort();
    }
}

#[inline(never)]
pub fn use_it(val: &dyn Foo) {
    val.my_method();
}

pub fn main() {
    use_it(&true);
}

results in the following optimized LLVM IR for use_it:

; playground::use_it
; Function Attrs: noinline noreturn nonlazybind uwtable
define internal fastcc void @_ZN10playground6use_it17h9abb7f0a088ea9f2E() unnamed_addr #4 {
start:
; call std::process::abort
  tail call void @_ZN3std7process5abort17h887d510ab1f66ceaE() #10
  unreachable
}
bjorn3 commented 2 years ago

This only happens when use_it isn't public. Notice how the llvm ir for use_it doesn't have any arguments. I think LLVM propagates the argument into use_it as it is constant, sees a load of a function pointer from an immutable vtable, replaces the load with a constant representing the actual function pointer and then turns the indirect call to the function pointer into a regular direct call.

hudson-ayers commented 2 years ago

Using the code you shared, I still see a vtable in the final binary: https://rust.godbolt.org/z/qzb6o9MKG . Were you using a different version of Rust, or different set of compilation flags to achieve this?

flip1995 commented 2 years ago

FYI, I have a semi-working version here: https://github.com/flip1995/rust/tree/pk-vfe

It seems to still remove a bit too many functions or possibly does something wrong when converting the type.checked.load intrinsics back to normal loads. I'm currently investigating this.


If the above example is compiled as a crate-type=lib, then the my_method doesn't get removed. Godbolt compiles everything as a lib AFAIK. I guess as a binary, LLVM just get's rid of the whole vtable of the crate, since it can determine all the types and can get rid of dynamic dispatch and with that can remove the vtable of Foo entirely. (A short investigation revealed that this is most likely done by the GlobalOptPass in LLVM:

Optimize globals that never have their address taken.

Aaron1011 commented 2 years ago

@hudson-ayers I didn't scroll down far enough in the playground output. While the function is no longer making a call through the vtable, the vtable itself is still present in the binary, as you noticed.

hudson-ayers commented 2 years ago

Is it correct this was closed by #96285 ? While dead virtual function elimination is awesome, it is not the same as whole program devirtualization, which can replace actually-called virtual functions with static calls. Supporting the first is not equivalent to supporting the second. And if -Zvirtual-function-elimination actually enables both, then it seems it should have a name that indicates as such

flip1995 commented 2 years ago

This shouldn't have been closed, because it isn't quite finished yet. IIRC (it has been some time since I actively worked on this) it only enables VFE, not WPD. Though WPD is enabled by default (for LTO at least) and VFE is only an extension of the WPD optimization.

flip1995 commented 2 years ago

For what is missing, see https://github.com/flip1995/rust/blob/195f2082002c9db456e0fde8c1d5db79929ae293/src/doc/unstable-book/src/compiler-flags/virtual-function-elimination.md and soon the unstable book.

Ragarnoy commented 4 months ago

Is this dead in the water then?

flip1995 commented 4 months ago

I don't think anyone is working on this right now. But in theory it should be possible to enable this optimization in rustc. What's missing is an analysis, that correctly determines the visibility of the vtable in the documented example. This is easier said then done though.