rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.25k stars 12.57k forks source link

Bad codegen for `non-copy-derived` struct with all `Copy` derived fields #128081

Closed CrazyboyQCD closed 2 weeks ago

CrazyboyQCD commented 2 months ago

Godbolt Link As you can see in the asm output, even set opt-level = 3, if we don't add Copy to structs with all fields Copy derived, in clone() it generates more mov and large struct can't trigger memcpy.

tgross35 commented 2 months ago

There was some brief discussion on Zulip about changing Clone usage to Copy as a MIR opt in cases where they are known to be the same. Don't remember details here and could very well be misremembering, but @scottmcm I think you might have been the one to bring it up?

scottmcm commented 2 months ago

For primitives we already change .clone() calls to copies (#94276) exactly so that the ones in derived Clone implementations become copies instead. As such, if you check MIR you'll see https://godbolt.org/z/T131P5YP1

        StorageLive(_45);
        _45 = ((*_1).43: u8);
        StorageLive(_46);
        _46 = ((*_1).44: u8);
        StorageLive(_47);
        _47 = ((*_1).45: u8);
        StorageLive(_48);
        _48 = ((*_1).46: u8);
        StorageLive(_49);
        _49 = ((*_1).47: u8);
        StorageLive(_50);
        _50 = ((*_1).48: u8);
        StorageLive(_51);
        _51 = ((*_1).49: u8);
        StorageLive(_52);
        _52 = ((*_1).50: u8);
        StorageLive(_53);
        _53 = ((*_1).51: u8);
        StorageLive(_54);
        _54 = ((*_1).52: [Dav1dSequenceHeaderOperatingParameterInfo; 32]);
        _0 = Dav1dSequenceHeader { profile: move _2, max_width: move _3, max_height: move _4, layout: move _5, pri: move _6, trc: move _7, mtrx: move _8, chr: move _9, hbd: move _10, color_range: move _11, num_operating_points: move _12, operating_points: move _13, still_picture: move _14, reduced_still_picture_header: move _15, timing_info_present: move _16, num_units_in_tick: move _17, time_scale: move _18, equal_picture_interval: move _19, num_ticks_per_picture: move _20, decoder_model_info_present: move _21, encoder_decoder_buffer_delay_length: move _22, num_units_in_decoding_tick: move _23, buffer_removal_delay_length: move _24, frame_presentation_delay_length: move _25, display_model_info_present: move _26, width_n_bits: move _27, height_n_bits: move _28, frame_id_numbers_present: move _29, delta_frame_id_n_bits: move _30, frame_id_n_bits: move _31, sb128: move _32, filter_intra: move _33, intra_edge_filter: move _34, inter_intra: move _35, masked_compound: move _36, warped_motion: move _37, dual_filter: move _38, order_hint: move _39, jnt_comp: move _40, ref_frame_mvs: move _41, screen_content_tools: move _42, force_integer_mv: move _43, order_hint_n_bits: move _44, super_res: move _45, cdef: move _46, restoration: move _47, ss_hor: move _48, ss_ver: move _49, monochrome: move _50, color_description_present: move _51, separate_uv_delta_q: move _52, film_grain_present: move _53, operating_parameter_info: move _54 };

which is copying the fields, then moving them into the aggregate.

https://rust-lang.github.io/rfcs/1521-copy-clone-semantics.html lets the standard library call Copy instead of Clone on things, but if -- like here -- the whole type isn't Copy that can't apply.

My instinct is that this should be filed to LLVM, because it's much better positioned to look at all the loads and stores we give it https://godbolt.org/z/4W9PP8nTW and coalesce them into something smaller.

And types should be marked Copy where possible because that allows RFC1521 to skip clones in lots of places in the standard library. Is there a reason that this one wasn't?

tgross35 commented 2 months ago

Could we do something like run the #[derive(Copy)] check (i.e. see if all members are Copy) on everything Clone, and then make it get the same clone -> copy transformation if Copy could apply?

There are some good reasons not to use Copy even when it would be allowed - having it means that new private non-Copy fields is API breakage. And then often it's not great to have the implicit duplication in your code (e.g. I think it's pretty common to turn off automatic #[derive(Copy)] in bindgen).

(That being said, it does seem like minimizing and opening an LLVM issue would be good since there is something it's not seeing through)

kkysen commented 2 months ago

And types should be marked Copy where possible because that allows RFC1521 to skip clones in lots of places in the standard library. Is there a reason that this one wasn't?

The reason we didn't is because the types are fairly large and so we want to avoid accidental copies. I was expecting a .clone() that would be identical to what a Copy would be to be optimized the same.

We may change this to #[derive(Copy)], too, now because of the much better optimization and perf is very important for us, but we'd definitely prefer not to, because these types aren't meant to be automatically copied.

It does seem like there should be a better way for std to detect this other than Copy, detaching the meaning of bitwise-copyable with auto-copying variables.

My instinct is that this should be filed to LLVM, because it's much better positioned to look at all the loads and stores we give it https://godbolt.org/z/4W9PP8nTW and coalesce them into something smaller.

@CrazyboyQCD, I think it'd be good to file this against LLVM, too. I would think LLVM should be able to optimize this without help from rustc.

CrazyboyQCD commented 2 months ago

@scottmcm, would you mind doing this for LLVM? I'm not quite sure how to describe this clearly.

tgross35 commented 2 months ago

Can you minimize the code example as much as possible? Remove fields, manually inline function calls, delete irrelevant code, etc as long as the issue still shows up.

If you do that, you can more or less just post the LLVM IR with Copy and the one without Copy to an issue, as long as you link the original godbolt. You can view the IR by clicking "add new" and then "LLVM IR" at the assembly tab. The goal is to show a missed optimization, i.e. "code A should be equivalent to code B but LLVM can't see it".

It's better yet if you can get something that reproduces with LLC. I don't have a great process for this but usually I copy the LLVM IR from the Rust to a LLC godbolt (I just use llvm.godbolt.org, set the input language to "LLVM IR") and try to delete more stuff there. Note you might need to manually demangle the function names so it actually compiles.

Scott is definitely far more in the know here than I am and can probably give some better suggestions, but if you can minimize it a bit then that's a great start :)

tgross35 commented 2 months ago

Sorry about that, Github decided to click a button for me.

CrazyboyQCD commented 2 months ago

@tgross35, just minimized the examples and pasted them.

DianQK commented 2 months ago

I want to see if I can use #94276 to complete this in codegen. :) Having LLVM recognize this pattern might take up a lot of compile time.

@rustbot claim

DianQK commented 2 months ago

Found a "future" regression, clone is better than copy when modifying some values: https://godbolt.org/z/xn1sKbs64. In LLVM, store optimizations are more than memcpy.

@rustbot label +A-LLVM

theemathas commented 3 weeks ago

Another example of the issue: Godbolt link