vezenovm commented 2 months ago

Description

Problem*

Resolves

Part of general effort to improve mem2reg.

Summary*

We sometimes have situations such as the following:

            b3():
              v9 = load v0
              v10 = eq v9, Field 2
              constrain v9 == Field 2
              v11 = load v2
              v12 = load v2
              v13 = eq v12, Field 2
              constrain v11 == Field 2

v2 does not have a known value, thus we do not remove the load. The mem2reg pass is acting as expected here. However, without a store or call to the reference between v11 = load v2 and v12 = load v2 we should be able to safely remove v12 = load v2 and map v12 -> v11.

This PR adds this logic as part of the initial mem2reg pass. We have a new last_loads map as part of a Block. This is currently cleared after analyzing block and is meant to only be per block. Unifying these last loads across blocks and the accurate predecessors can come in a follow-up. This is an initial proof of concept to show the optimizations validity.

Given an instruction we act as following:

Load

Check if we have a last load from the current load address. If we do, remove the current current and map its result to the previous load result.
Add a last load for the address.
Store
- Remove the last load for the address
  Call
- Remove the last load for any reference arguments

I have also added two unit tests to mem2reg.rs

Additional Context

Documentation*

Check one:

[X] No documentation needed.
[ ] Documentation included in this PR.
[ ] [For Experimental Features] Documentation to be submitted in a separate PR.

PR Checklist*

[X] I have tested the changes locally.
[X] I have formatted the changes with Prettier and/or cargo fmt on default settings.

github-actions[bot] commented 2 months ago

Changes to Brillig bytecode sizes

Generated at commit: f637f27cdf7bde3c5e23dbaea72299b285aefe39, compared to commit: 619c5451b152d62e01d3c4c1da7e13ff6502f915

🧾 Summary (10% most significant diffs)

Program	Brillig opcodes (+/-)	%
brillig_rc_regression_6123	+19 ❌	+11.24%
sha256	+63 ❌	+3.39%
sha256_var_witness_const_regression	+43 ❌	+3.38%
sha256_var_padding_regression	+168 ❌	+3.34%
keccak256	+52 ❌	+3.04%
brillig_keccak	+52 ❌	+3.04%
6_array	-14 ✅	-3.55%
array_to_slice	-51 ✅	-5.62%

Full diff report 👇

| Program | Brillig opcodes (+/-) | % | |:-|-:|-:| | **brillig_rc_regression_6123** | 188 (+19) | **+11.24%** | | **sha256** | 1,924 (+63) | **+3.39%** | | **sha256_var_witness_const_regression** | 1,317 (+43) | **+3.38%** | | **sha256_var_padding_regression** | 5,197 (+168) | **+3.34%** | | **keccak256** | 1,765 (+52) | **+3.04%** | | **brillig_keccak** | 1,765 (+52) | **+3.04%** | | **conditional_regression_short_circuit** | 1,361 (+40) | **+3.03%** | | **sha256_var_size_regression** | 1,961 (+55) | **+2.89%** | | **fold_call_witness_condition** | 107 (+3) | **+2.88%** | | **reference_only_used_as_alias** | 262 (+7) | **+2.75%** | | **brillig_sha256** | 762 (+20) | **+2.70%** | | **conditional_1** | 1,283 (+32) | **+2.56%** | | **array_dynamic_blackbox_input** | 1,179 (+29) | **+2.52%** | | **brillig_constant_reference_regression** | 122 (+3) | **+2.52%** | | **regression_4449** | 821 (+20) | **+2.50%** | | **7_function** | 581 (+14) | **+2.47%** | | **brillig_cow_assign** | 125 (+3) | **+2.46%** | | **sha256_regression** | 7,059 (+156) | **+2.26%** | | **6** | 1,788 (+38) | **+2.17%** | | **array_dynamic_nested_blackbox_input** | 951 (+20) | **+2.15%** | | **ecdsa_secp256k1** | 975 (+20) | **+2.09%** | | **brillig_cow_regression** | 2,381 (+45) | **+1.93%** | | **brillig_oracle** | 331 (+6) | **+1.85%** | | **regression** | 699 (+9) | **+1.30%** | | **fold_numeric_generic_poseidon** | 707 (+9) | **+1.29%** | | **no_predicates_numeric_generic_poseidon** | 707 (+9) | **+1.29%** | | **higher_order_functions** | 728 (+9) | **+1.25%** | | **bigint** | 2,065 (+25) | **+1.23%** | | **regression_struct_array_conditional** | 512 (+6) | **+1.19%** | | **aes128_encrypt** | 517 (+6) | **+1.17%** | | **fold_2_to_17** | 529 (+6) | **+1.15%** | | **hashmap** | 26,668 (+296) | **+1.12%** | | **wildcard_type** | 275 (+3) | **+1.10%** | | **bench_2_to_17** | 296 (+3) | **+1.02%** | | **poseidon2** | 302 (+3) | **+1.00%** | | **fold_complex_outputs** | 657 (+6) | **+0.92%** | | **array_sort** | 343 (+3) | **+0.88%** | | **databus_two_calldata** | 252 (+2) | **+0.80%** | | **nested_array_dynamic** | 2,318 (+16) | **+0.70%** | | **schnorr** | 1,504 (+6) | **+0.40%** | | **uhashmap** | 23,584 (+69) | **+0.29%** | | **pedersen_hash** | 391 (+1) | **+0.26%** | | **sha2_byte** | 3,156 (+7) | **+0.22%** | | **brillig_pedersen** | 553 (+1) | **+0.18%** | | **pedersen_check** | 553 (+1) | **+0.18%** | | **simple_shield** | 655 (+1) | **+0.15%** | | **poseidon_bn254_hash_width_3** | 5,493 (-6) | **-0.11%** | | **poseidon_bn254_hash** | 5,493 (-6) | **-0.11%** | | **nested_array_in_slice** | 1,197 (-3) | **-0.25%** | | **regression_5252** | 5,133 (-13) | **-0.25%** | | **eddsa** | 11,145 (-34) | **-0.30%** | | **poseidonsponge_x5_254** | 4,305 (-14) | **-0.32%** | | **generics** | 187 (-1) | **-0.53%** | | **tuple_inputs** | 338 (-3) | **-0.88%** | | **array_if_cond_simple** | 99 (-1) | **-1.00%** | | **to_le_bytes** | 88 (-1) | **-1.12%** | | **u128** | 3,072 (-35) | **-1.13%** | | **slices** | 2,001 (-25) | **-1.23%** | | **hash_to_field** | 115 (-2) | **-1.71%** | | **brillig_hash_to_field** | 115 (-2) | **-1.71%** | | **slice_dynamic_index** | 2,528 (-47) | **-1.83%** | | **references** | 260 (-7) | **-2.62%** | | **6_array** | 380 (-14) | **-3.55%** | | **array_to_slice** | 856 (-51) | **-5.62%** |

vezenovm commented 2 months ago

reference_only_used_as_alias | +7 ❌ | +2.81%

Hmm this is surprising. I'm guessing that removing some of these loads is perhaps reducing the amount of trivial stores we can remove, but I'm not sure.

vezenovm commented 2 months ago

reference_only_used_as_alias | +7 ❌ | +2.81%

Hmm this is surprising. I'm guessing that removing some of these loads is perhaps reducing the amount of trivial stores we can remove, but I'm not sure.

The main difference between the SSA on master and this PR looks to be the inc_rc instructions remaining in place. Before mem2reg we have this pattern:

    v61 = load v52
    inc_rc v61
    inc_rc [u8 0, u8 0, u8 0, u8 0, u8 0]
    inc_rc [u8 0, u8 0, u8 0, u8 0, u8 0]
    v64 = load v51
    inc_rc v64
    v65 = load v52
    inc_rc v65
    v66 = load v52
    inc_rc v66
    v67 = load v52
    v68 = load v53
    v70 = lt v68, u32 4
    constrain v70 == u1 1 '"push out of bounds"'
    v72 = load v52
    v73 = load v53
    v74 = load v52
    v75 = load v53
    v76 = mul v75, u32 4
    v77 = array_set v72, index v76, value Field 0

The repeat loads are removed in this PR, but those follow-up inc_rc instructions remain. I think this can be handled in a follow-up though so I am marking this PR ready for review again.

jfecher commented 1 month ago

I tried this with the aliasing test in https://github.com/noir-lang/noir/issues/6120 and compared to master there's another regression. The output with this PR is:

acir(inline) fn main f0 {
  b0(v3: u1):
    jmpif v3 then: b1, else: b2
  b1():
    v11 = allocate
    jmp b3(v3, v3)
  b3(v6: &mut Field, v7: &mut Field):
    v12 = load v6
    store Field 2 at v7
    constrain v12 == Field 0
    constrain v12 == Field 2
    return 
  b2():
    v10 = allocate
    jmp b3(v3, v3)
}

The results of both constrains are replaced with v12 even though they should be different values due to the store Field 2 at v7 where v7 aliases v6. Originally the == FIeld 2 was done on a load that came after the store, which should then also hold 2.

jfecher commented 1 month ago

Original example from #6120 had a typo. After fixing the typo it works on master but is still seeing a regression in this PR. Here's the final output:

acir(inline) fn main f0 {
  b0(v3: u1):
    jmpif v3 then: b1, else: b2
  b1():
    v11 = allocate
    store Field 0 at v11
    jmp b3(v11, v11)
  b3(v6: &mut Field, v7: &mut Field):
    v12 = load v6
    store Field 2 at v7
    constrain v12 == Field 0
    constrain v12 == Field 2
    return 
  b2():
    v10 = allocate
    store Field 1 at v10
    jmp b3(v10, v10)
}

I've fixed the typo in the issue so that this can be reproduced

vezenovm commented 1 month ago

I tried this with the aliasing test in #6120 and compared to master there's another regression

I have switched keep_repeat_loads_with_alias_store to the test in the issue, but I added an extra parameter to b3 to also check that we are accounting for parameters with more just one other alias.

vezenovm commented 1 month ago

brillig_rc_regression_6123 +19 ❌ +10.16% fold_2_to_17 +42 ❌ +6.89% bench_2_to_17 +21 ❌ +6.60% fold_numeric_generic_poseidon +51 ❌ +6.57% no_predicates_numeric_generic_poseidon +51 ❌ +6.57% poseidon2 +21 ❌ +6.54%

Looks like we are getting various regressions now after the RC correctness fix. Going to table this PR for now, and look at further optimizing RC instruction removals .

noir-lang / noir

feat(perf): Track last loads per block in mem2reg and remove them if possible #6088

Description

Problem*

Summary*

Load

Store

Call

Additional Context

Documentation*

PR Checklist*

Changes to Brillig bytecode sizes

🧾 Summary (10% most significant diffs)