scroll-tech / halo2

Other
43 stars 39 forks source link

Multi-input mv-lookup. #49

Closed spherel closed 1 year ago

spherel commented 1 year ago

These are the papers related to this lookup argument.

Velaciela commented 1 year ago

benchmarking on AWS g5.x12large machine (48 vCore CPU, 192GB RAM) scroll-prover (v0.5.10) super circuit integration test (test_chunk_prove_verify)

note: the result have included parallel batch invert optimization

[mv-lookup]

Start: phase2
End: phase2 ....................................................................66.325s
Start: phase3
End: phase3 ....................................................................275.776s

phase4 time total 2881.969s <> peak memory: 119GB, for each part:

··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................109.360s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.608s
··Start: permutations_time
··End: permutations_time .......................................................25.542s
··Start: lookup_time
····Start: inputs_inv_sum_time
····End: inputs_inv_sum_time ...................................................100.766s
····Start: lookups_iter_time (cosetfftpart and constraints calculation)
····End: lookups_iter_time .....................................................41.214s
··End: lookup_time .............................................................**141.981s**

[lookup]

Start: phase2
End: phase2 ....................................................................367.301s
Start: phase3
End: phase3 ....................................................................306.957s

phase4 time total: 2740.878s <> peak memory: 130GB, for each part:

··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................117.650s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.640s
··Start: permutations_time
··End: permutations_time .......................................................25.552s
··Start: lookup_time (cosetfftpart and constraints calculation)
··End: lookup_time .............................................................**120.937s**
spherel commented 1 year ago

benchmarking on AWS g5.x12large machine (48 vCore CPU, 192GB RAM) scroll-prover (v0.5.10) super circuit integration test (test_chunk_prove_verify)

note: the result have included parallel batch invert optimization

[mv-lookup]

Start: phase2
End: phase2 ....................................................................66.325s
Start: phase3
End: phase3 ....................................................................275.776s

phase4 time total 2881.969s <> peak memory: 119GB, for each part:

··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................109.360s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.608s
··Start: permutations_time
··End: permutations_time .......................................................25.542s
··Start: lookup_time
····Start: inputs_inv_sum_time
····End: inputs_inv_sum_time ...................................................100.766s
····Start: lookups_iter_time (cosetfftpart and constraints calculation)
····End: lookups_iter_time .....................................................41.214s
··End: lookup_time .............................................................**141.981s**

[lookup]

Start: phase2
End: phase2 ....................................................................367.301s
Start: phase3
End: phase3 ....................................................................306.957s

phase4 time total: 2740.878s <> peak memory: 130GB, for each part:

··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................117.650s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.640s
··Start: permutations_time
··End: permutations_time .......................................................25.552s
··Start: lookup_time (cosetfftpart and constraints calculation)
··End: lookup_time .............................................................**120.937s**

Phase2 computes lookup.prepare, including preparing the compressed input and table expressions and counting the occurrence of table entries in the input. Phase3 computes the running product of the permutation argument and the grand sum of the lookup argument. Phase4 prepares the quotient polynomial for the whole constraint system.

spherel commented 1 year ago

MVLookup MSM_COUNTER: {1048576: 1054} FFT_COUNTER: {1048576: 12431, 8388608: 1}

Lookup MSM_COUNTER: {1048576: 1652} FFT_COUNTER: {1048576: 17813, 8388608: 1}

huwenqing0606 commented 1 year ago

a brief doc for this PR https://www.overleaf.com/read/vphfbrztgsmp, any comments/corrections would be greatly appreciated!

kunxian-xia commented 1 year ago

benchmarking on AWS g5.x12large machine (48 vCore CPU, 192GB RAM) scroll-prover (v0.5.10) super circuit integration test (test_chunk_prove_verify)

note: the result have included parallel batch invert optimization

[mv-lookup]

Start: phase2
End: phase2 ....................................................................66.325s
Start: phase3
End: phase3 ....................................................................275.776s

phase4 time total 2881.969s <> peak memory: 119GB, for each part:

··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................109.360s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.608s
··Start: permutations_time
··End: permutations_time .......................................................25.542s
··Start: lookup_time
····Start: inputs_inv_sum_time
····End: inputs_inv_sum_time ...................................................100.766s
····Start: lookups_iter_time (cosetfftpart and constraints calculation)
····End: lookups_iter_time .....................................................41.214s
··End: lookup_time .............................................................**141.981s**

[lookup]

Start: phase2
End: phase2 ....................................................................367.301s
Start: phase3
End: phase3 ....................................................................306.957s

phase4 time total: 2740.878s <> peak memory: 130GB, for each part:

··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................117.650s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.640s
··Start: permutations_time
··End: permutations_time .......................................................25.552s
··Start: lookup_time (cosetfftpart and constraints calculation)
··End: lookup_time .............................................................**120.937s**

@Velaciela can you update the benchmark using the latest circuit (v0.9.3)?

naure commented 1 year ago

A summary of what happens for K lookups into the same table:

Then for each batch:

No helper columns are needed, but the batch size is limited by max degree.

Here is an optimisation:

Instead, for each following batch:

This way the multiplicity m and the accumulator are amortized across all lookups. With for example 4 lookups per helper columns, that's a cost of 0.25 columns per lookup + 1 shared multiplicity.

The accumulator counts as the helper of the first batch here. There is also the option to sacrifice this for simplicity; that is, use the same logic for all batches without the exception for the first batch. In that case, that would be 0.25 per lookup + 2.

(cc @spherel)

kunxian-xia commented 1 year ago

Note that this pr still has compilation errors. Even though, I will merge this pr into the develop-rc branch first as i don't have access to geometry's repo. Then i will make necessary changes there. cc @spherel

AlekseiVambol commented 11 months ago

a brief doc for this PR https://www.overleaf.com/read/vphfbrztgsmp, any comments/corrections would be greatly appreciated!

A summary of what happens for K lookups into the same table:

* Split K into batches K1, K2, etc, such that each batch respects the max constraint degree.

Then for each batch:

* Create a multiplicity column m.

* Create an accumulator.

* The constraint works on the difference between steps of the accumulator.

No helper columns are needed, but the batch size is limited by max degree.

Here is an optimisation:

* The first batch K1 works the same as above.

* But following batches do not need another instance of the entire argument.

Instead, for each following batch:

* Create a helper column containing inverses: `helper = 1 / (denom_0 * denom_1 * …)`

* A constraint verifies the divisions of this batch using the the helper column.

* The terms are added to the single accumulator. That does not increase its degree. The multiplicities are also added into the single m.

This way the multiplicity m and the accumulator are amortized across all lookups. With for example 4 lookups per helper columns, that's a cost of 0.25 columns per lookup + 1 shared multiplicity.

The accumulator counts as the helper of the first batch here. There is also the option to sacrifice this for simplicity; that is, use the same logic for all batches without the exception for the first batch. In that case, that would be 0.25 per lookup + 2.

(cc @spherel)

Do I correctly understand that you propose to use one common constraint enforcing image where the sum is replaced by the sum of the polynomials A_i(x) found using interpolation, each of which corresponds to a separate batch and is constrained by something like this: image where K and f-polynomials a batch-specific?

AlekseiVambol commented 11 months ago

a brief doc for this PR https://www.overleaf.com/read/vphfbrztgsmp, any comments/corrections would be greatly appreciated!

Good idea and nice paper draft! Two things can be done to make it even better:

  1. It would be beneficient for people, who are not too familiar with transformation of the abstract polynomial protocols into the concrete ones by means of the polynomial commitments, to describe the motivation for transforming (4) into (5) by multiplying both sides by the same polynomial. If I am not mistaken, I would propose to describe this in the following way:

"Some identity A(x) = B(x) over some domain, which in an abstract polynomial protocol would be checked by a trusted party, in the case of a concrete protocol based on the polynomial commitment is checked as follows:

  1. The typo on the page 3 should be fixed: "1" has to be instead of "0" - image