rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.33k stars 12.72k forks source link

Optimize copies of large enums #54360

Open Manishearth opened 6 years ago

Manishearth commented 6 years ago

For types like SmallVec<[T; 1000]>, or in general an enum where the variants have a huge difference in size, we should probably try to optimize the copies better.

Basically, for enums with some large-enough difference between variant sizes, we should use a branch when codegenning copies/moves.

I'm not sure how common this pattern is, but it's worth looking into!

cc @rust-lang/wg-codegen

leonardo-m commented 6 years ago

Discussed here: https://old.reddit.com/r/programming/comments/9gzppu/falling_in_love_with_rust/e68tlfw/?st=jm8z60p1

eddyb commented 6 years ago

I have two theories/ideas for heuristics (without investigating anything first):

  1. a branch would cost you less than what memcpy would take to copy one chunk, of whatever best-case-scenario size it can support - and that best-case might be related to platform SIMD sizes
  2. branching is worth it, if it saves you from accessing more data cache lines - then, you want to take into account common cache line sizes (usually 64 bytes IIUC)

cc @rkruppe

eternaleye commented 6 years ago

One way of tackling this might be:

  1. For each enum type, emit an rodata array mapping discriminants to variant sizes
  2. For any enum that is "sufficiently heterogeneous" in size, emit memcpys of the result of a lookup in the relevant array

Since the lookup happening is deterministic, and the indices (and thus arrays) are likely to be small for hugely-heterogeneous enums, the arrays are both 1.) easily prefetched by the CPU and 2.) likely to fit in a cacheline. As a result, most cases may avoid having any observable latency penalty unless the memory bus is saturated.

EDIT: Also, this pollutes the D$ but not the I$ or the branch predictor, and the D$ is going to be prodded by the memcpy anyway.

steveklabnik commented 4 years ago

Triage: I'm not aware of any changes here

oli-obk commented 4 years ago

This should probably be written as a MIR optimization. The mentioned array can be created as a const directly in the mir (mentioned via Rvalue::Use(Operand::Const(...))). If we add a query that takes a Ty<'tcx> and emits said constant, we'll even get deduplication for free.