Conditional RAM writes scale nonlinearly in loops

sirasistant commented 3 months ago

Aim

Using conditional ram writes:

global SIZE = 500;

fn main(mut fields: [Field; SIZE], enables: [bool; SIZE], indices: [u64; SIZE]) -> pub [Field; SIZE] {
    for i in 0..SIZE {
        if enables[i] {
            fields[indices[i]] = i as Field;
        }
    }

    fields
}

Expected Behavior

I expected conditional RAM writes to scale in the same way with the number of iterations as the unconditional ones

global SIZE = 500;

fn main(mut fields: [Field; SIZE], indices: [u64; SIZE]) -> pub [Field; SIZE] {
    for i in 0..SIZE {
        fields[indices[i]] = i as Field;
    }

    fields
}

Bug

The first example scales nonlinearly: With size=30 it's 10k constraints (333 constraints/iteration) and with size=500 it's 1.65M constraints (3300 constraints/iteration)

The second example (unconditional) scales sublinearly (possibly because of fixed costs): with size=100 it's 4.6k constraints and with size=500 it's 12k constraints.

Also, it seems like even without the nonlinear scaling the conditional ram writes are quite expensive (10x the cost of a unconditional one in the best case)

To Reproduce

1. 2. 3. 4.

Project Impact

None

Impact Context

No response

Workaround

None

Workaround Description

No response

Additional Context

No response

Installation Method

None

Nargo Version

No response

NoirJS Version

No response

Would you like to submit a PR for this Issue?

None

Support Needs

No response

TomAFrench commented 3 months ago

SSA for the expensive program with SIZE = 2

After Array Set Optimizations:
acir(inline) fn main f0 {
  b0(v0: [Field; 2], v1: [u1; 2]):
    v92 = array_get v1, index u64 0
    enable_side_effects v92
    v93 = array_get v0, index u64 0
    enable_side_effects u1 1
    v95 = cast v92 as Field
    v97 = mul v95, v93
    v98 = cast v92 as u64
    v100 = array_get v1, index u64 1
    enable_side_effects v100
    v101 = array_get v0, index u64 1
    v102 = lt v98, u64 2
    v103 = mul v102, v100
    constrain v103 == v100 'push out of bounds'
    v104 = array_set [v97, Field 0], index v98, value v101
    v105 = add v98, u64 1
    v106 = not v100
    enable_side_effects v100
    v107 = array_get v104, index v98
    v108 = array_get [v97, Field 0], index v98
    v109 = cast v100 as Field
    v110 = cast v106 as Field
    v111 = mul v109, v107
    v112 = mul v110, v108
    v113 = add v111, v112
    v114 = array_set mut v104, index v98, value v113
    enable_side_effects u1 1
    v115 = cast v100 as u64
    v116 = cast v106 as u64
    v117 = mul v115, v105
    v118 = mul v116, v98
    v119 = add v117, v118
    constrain v119 == u64 0
    return v114
}

One thing that jumps out is the check that we're not pushing past the capacity of the BoundedVec. We have enough information here that we should be able to rule that out at compile time.

TomAFrench commented 3 months ago

https://github.com/noir-lang/noir/blob/9c6de4b25d318c6d211361dd62a112a9d2432c56/noir_stdlib/src/collections/bounded_vec.nr#L24

We should be able to just remove this assertion (or throw the error in an unconstrained block if we want to maintain the error message). The attempt to write past the end of the array would trigger the ACVM to fail.

TomAFrench commented 3 months ago

This ends up being just a small contribution (~4000) constraints.

guipublic commented 3 months ago

Did you really compare the number of constraints or just the number of acir opcodes? Because 25 constraints for writing 500 times into an array of 500 elements does not seem right. I would expect ~2000 instead.

sirasistant commented 3 months ago

@guipublic

I was referring to 25 constraints/item, will update first message!

This function:

global SIZE = 500;

fn main(fields: [Field; SIZE], to_keep: [bool; SIZE]) -> pub [Field; SIZE] {
    let mut bounded_vec: BoundedVec<Field, SIZE> = BoundedVec::new();

    for i in 0..SIZE {
        if to_keep[i] {
            bounded_vec.push(fields[i]);
        }
    }

    assert_eq(bounded_vec.len(), 0);
    bounded_vec.storage
}

is 1.6M gates

This one:

global SIZE = 500;

fn main(mut fields: [Field; SIZE], indices: [u64; SIZE]) -> pub [Field; SIZE] {
    for i in 0..SIZE {
        fields[indices[i]] = i as Field;
    }

    fields
}

is 12k gates

jfecher commented 3 months ago

One thing that jumps out is the check that we're not pushing past the capacity of the BoundedVec. We have enough information here that we should be able to rule that out at compile time.

This looks like it'd be difficult to rule out at compile-time. We know that we only push at most once per loop and only loop SIZE times but the compiler just sees that each push is conditional on to_keep and can't keep an exact count. If we wanted to remove this condition we'd need much more meticulous tracking of maximum bounds, probably tracking each individual case of the value (e.g. len can be 0 or 1 here) then verifying all of those cases are < 2.

sirasistant commented 3 months ago

The length check is not the main contributor though, there seems to be some weird scaling with loop sizes going on. Constraint counts per iteration become much more similar if we decrease the number of iterations to 20. In that case, the version costs 320 constraints/write (a 10x reduction per iteration by scaling down the number of iterations)

sirasistant commented 3 months ago

To avoid any boundedvec related overhead, tried this code:

global SIZE = 500;

fn main(mut fields: [Field; SIZE], enables: [bool; SIZE], indices: [u64; SIZE]) -> pub [Field; SIZE] {
    for i in 0..SIZE {
        if enables[i] {
            fields[indices[i]] = i as Field;
        }
    }

    fields
}

And it also scales nonlinearly. With size=500 it's 1.65M constraints (3300 constraints/iteration) and with size=30 it's 10k constraints (333 constraints/iteration)

sirasistant commented 3 months ago

Updating the original issue

jfecher commented 3 months ago

And it also scales nonlinearly. With size=500 it's 1.65M constraints (3300 constraints/iteration) and with size=30 it's 10k constraints (333 constraints/iteration)

Sounds related to https://github.com/noir-lang/noir/issues/4629 then.

Testing your most recent example in the comment I get:

SIZE | Acir Opcodes | Change
2    |     57       |
3    |     88       |   31
4    |     121      |   33
5    |     156      |   35

So it is increasing by 2 per each increase in SIZE.

sirasistant commented 3 months ago

Is there a way to use the as_witness intrinsic to fix this nonlinear scaling? or it's not fixable by that? Haven't looked much into the actual constraints being generated for predicated RAM writes yet

vezenovm commented 1 month ago

For this program where the SIZE = 2:

global SIZE = 2;

fn main(mut fields: [Field; SIZE], enables: [bool; SIZE], indices: [u64; SIZE]) -> pub [Field; SIZE] {
    for i in 0..SIZE {
        if enables[i] {
            fields[indices[i]] = i as Field;
        }
    }

    fields
}

we get the following SSA:

acir(inline) fn main f0 {
  b0(v0: [Field; 2], v1: [u1; 2], v2: [u64; 2]):
    inc_rc v0
    v69 = array_get v1, index u32 0
    enable_side_effects v69
    v70 = array_get v2, index u32 0
    v71 = cast v70 as u32
    v72 = array_set v0, index v71, value Field 0
    v74 = not v69
    v75 = array_get v72, index v71
    v76 = array_get v0, index v71
    v77 = cast v69 as Field
    v78 = cast v74 as Field
    v79 = mul v77, v75
    v80 = mul v78, v76
    v81 = add v79, v80
    v82 = array_set mut v72, index v71, value v81
    enable_side_effects u1 1
    v83 = array_get v1, index u32 1
    enable_side_effects v83
    v84 = array_get v2, index u32 1
    v85 = cast v84 as u32
    v86 = array_set v82, index v85, value Field 1
    v88 = not v83
    v89 = array_get v86, index v85
    v90 = array_get v82, index v85
    v91 = cast v83 as Field
    v92 = cast v88 as Field
    v93 = mul v91, v89
    v94 = mul v92, v90
    v95 = add v93, v94
    v96 = array_set mut v86, index v85, value v95
    enable_side_effects u1 1
    dec_rc v0
    return v96
}

We have two places where we are not optimally setting up the array gets/sets.

    v72 = array_set v0, index v71, value Field 0
    v74 = not v69
    v75 = array_get v72, index v71
    v76 = array_get v0, index v71

and

    v86 = array_set v82, index v85, value Field 1
    v88 = not v83
    v89 = array_get v86, index v85
    v90 = array_get v82, index v85

For both v76 = array_get v0, index v71 and v90 = array_get v82, index v85. We can call the array_get before the array_set in both instances, but right now we end up writing and copying the dynamic array when we do not need to do so. For example, we could instead read from v82 before the array_set and then the can have a mut v82 in the array set. This can save us having to copy the dynamic array.

This is in line with the conclusion from this comment above:

So it is increasing by 2 per each increase in SIZE.

As we increase SIZE by one we have two extra dynamic reads to copy over the array from the previous SIZE.

jfecher commented 1 month ago

As we increase SIZE by one we have two extra dynamic reads to copy over the array from the previous SIZE.

That sounds like it would just be a change in velocity of 2 extra constraints, not an acceleration of 2 extra constraints. If there were a constant 2 extra constraints per each SIZE increase we'd see no change in acceleration / nonlinear scaling.

(pardon my velocity/acceleration analogy)

vezenovm commented 1 month ago

That sounds like it would just be a change in velocity of 2 extra constraints, not an acceleration of 2 extra constraints. If there were a constant 2 extra constraints per each SIZE increase we'd see no change in acceleration / nonlinear scaling.

For clarity sake: I am using ACIR gates as my metric as I find it a bit easier to reason about for ACIR gen than backend constraint count.

If we increase SIZE by one we have two extra dynamic reads to copy over the array from the previous SIZE, but we are also doing those two extra dynamic reads an extra time. e.g. if we made SIZE = 6 the SSA above would repeat three more times. So we would be copying and initializing a new array of size 6 now 6 times, rather than copying and initializing a new size 2 array 2 times. This is what is exponential.

jfecher commented 1 month ago

Thanks for the explanation, I misunderstood before.

I wonder if https://github.com/noir-lang/noir/pull/5449 would help (or fix :crossed_fingers:) this. With that I'd expect:

    v72 = array_set v0, index v71, value Field 0
    v74 = not v69
    v75 = array_get v72, index v71
    v76 = array_get v0, index v71

To be turned into

    v72 = array_set v0, index v71, value Field 0
    v74 = not v69
    // v75 optimized to constant Field 0
    v76 = array_get v0, index v71

and

    v86 = array_set v82, index v85, value Field 1
    v88 = not v83
    v89 = array_get v86, index v85
    v90 = array_get v82, index v85

to be optimized to:

    v86 = array_set v82, index v85, value Field 1
    v88 = not v83
    // v89 optimized to constant Field 1
    v90 = array_get v82, index v85

TomAFrench commented 1 month ago

I was having a look at whether #5449 would help on this as well. I get the SSA

After Array Set Optimizations:
acir(inline) fn main f0 {
  b0(v0: [Field; 2], v1: [u1; 2], v2: [u64; 2]):
    inc_rc v0
    v59 = array_get v1, index u32 0
    enable_side_effects v59
    v60 = array_get v2, index u32 0
    v61 = cast v60 as u32
    v62 = array_set v0, index v61, value Field 0
    v64 = not v59
    v65 = array_get v0, index v61
    v67 = cast v64 as Field
    v68 = mul v67, v65
    v69 = array_set mut v62, index v61, value v68
    enable_side_effects u1 1
    v70 = array_get v1, index u32 1
    enable_side_effects v70
    v71 = array_get v2, index u32 1
    v72 = cast v71 as u32
    v73 = array_set v69, index v72, value Field 1
    v75 = not v70
    v76 = array_get v69, index v72
    v77 = cast v70 as Field
    v78 = cast v75 as Field
    v79 = mul v78, v76
    v80 = add v77, v79
    v81 = array_set mut v73, index v72, value v80
    enable_side_effects u1 1
    dec_rc v0
    return v81
}

which lines up with your predictions.

jfecher commented 1 month ago

Testing Maxime's example with that PR on different sizes I get fewer constraints:

size | acir opcodes
2    | 43
3    | 67
4    | 93

But it is still accelerating by 2 each time. (For reference on master the opcode count for these sizes is 51, 79, and 109 respectively)

vezenovm commented 1 month ago

From Tom's new SSA above we get less constraints but we still have these blocks:

v73 = array_set v69, index v72, value Field 1
v75 = not v70
v76 = array_get v69, index v72

The issue is that this array_get should be able to happen before this array_set. Then v69 can be a mut v69.

jfecher commented 1 month ago

v73 = array_set v69, index v72, value Field 1
v75 = not v70
v76 = array_get v69, index v72
Odd, I'd have expected v76 to be optimized out by Tom's PR. I'd also think we don't want the array_get to be before the array_set though since that'd change the result.

vezenovm commented 1 month ago

I'd also think we don't want the array_get to be before the array_set though since that'd change the result.

v69 is immutable no?

jfecher commented 1 month ago

v69 is immutable no?

Right, my bad. I was staring at that snippet for a bit too long :sweat_smile:

TomAFrench commented 1 month ago

Odd, I'd have expected v76 to be optimized out by Tom's PR.

That would require the read to be performed on v73 rather than v69

jfecher commented 1 month ago

That would require the read to be performed on v73 rather than v69

Right, I forgot how array_set works

TomAFrench commented 1 month ago

We should be able to reorder these array reads/writes, my only concern is if this could result in changes to the error message being emitted in some cases. e.g. we move an out-of-bounds read earlier in the program which causes execution to "jump" to later in the program.

noir-lang / noir