Closed spherel closed 11 months ago
benchmarking on AWS g5.x12large machine (48 vCore CPU, 192GB RAM
)
scroll-prover (v0.5.10)
super circuit integration test (test_chunk_prove_verify)
note: the result have included parallel batch invert optimization
[mv-lookup]
Start: phase2
End: phase2 ....................................................................66.325s
Start: phase3
End: phase3 ....................................................................275.776s
phase4 time total 2881.969s <> peak memory: 119GB, for each part:
··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................109.360s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.608s
··Start: permutations_time
··End: permutations_time .......................................................25.542s
··Start: lookup_time
····Start: inputs_inv_sum_time
····End: inputs_inv_sum_time ...................................................100.766s
····Start: lookups_iter_time (cosetfftpart and constraints calculation)
····End: lookups_iter_time .....................................................41.214s
··End: lookup_time .............................................................**141.981s**
[lookup]
Start: phase2
End: phase2 ....................................................................367.301s
Start: phase3
End: phase3 ....................................................................306.957s
phase4 time total: 2740.878s <> peak memory: 130GB, for each part:
··Start: coeff_to_extended_part
··End: coeff_to_extended_part ..................................................117.650s
··Start: custom_gates_time
··End: custom_gates_time .......................................................77.640s
··Start: permutations_time
··End: permutations_time .......................................................25.552s
··Start: lookup_time (cosetfftpart and constraints calculation)
··End: lookup_time .............................................................**120.937s**
benchmarking on AWS g5.x12large machine (
48 vCore CPU, 192GB RAM
)scroll-prover (v0.5.10)
super circuit integration test (test_chunk_prove_verify)note: the result have included parallel batch invert optimization
[mv-lookup]
Start: phase2 End: phase2 ....................................................................66.325s Start: phase3 End: phase3 ....................................................................275.776s
phase4 time total 2881.969s <> peak memory: 119GB, for each part:
··Start: coeff_to_extended_part ··End: coeff_to_extended_part ..................................................109.360s ··Start: custom_gates_time ··End: custom_gates_time .......................................................77.608s ··Start: permutations_time ··End: permutations_time .......................................................25.542s ··Start: lookup_time ····Start: inputs_inv_sum_time ····End: inputs_inv_sum_time ...................................................100.766s ····Start: lookups_iter_time (cosetfftpart and constraints calculation) ····End: lookups_iter_time .....................................................41.214s ··End: lookup_time .............................................................**141.981s**
[lookup]
Start: phase2 End: phase2 ....................................................................367.301s Start: phase3 End: phase3 ....................................................................306.957s
phase4 time total: 2740.878s <> peak memory: 130GB, for each part:
··Start: coeff_to_extended_part ··End: coeff_to_extended_part ..................................................117.650s ··Start: custom_gates_time ··End: custom_gates_time .......................................................77.640s ··Start: permutations_time ··End: permutations_time .......................................................25.552s ··Start: lookup_time (cosetfftpart and constraints calculation) ··End: lookup_time .............................................................**120.937s**
Phase2 computes lookup.prepare
, including preparing the compressed input and table expressions and counting the occurrence of table entries in the input.
Phase3 computes the running product of the permutation argument and the grand sum of the lookup argument.
Phase4 prepares the quotient polynomial for the whole constraint system.
MVLookup MSM_COUNTER: {1048576: 1054} FFT_COUNTER: {1048576: 12431, 8388608: 1}
Lookup MSM_COUNTER: {1048576: 1652} FFT_COUNTER: {1048576: 17813, 8388608: 1}
a brief doc for this PR https://www.overleaf.com/read/vphfbrztgsmp, any comments/corrections would be greatly appreciated!
benchmarking on AWS g5.x12large machine (
48 vCore CPU, 192GB RAM
)scroll-prover (v0.5.10)
super circuit integration test (test_chunk_prove_verify)note: the result have included parallel batch invert optimization
[mv-lookup]
Start: phase2 End: phase2 ....................................................................66.325s Start: phase3 End: phase3 ....................................................................275.776s
phase4 time total 2881.969s <> peak memory: 119GB, for each part:
··Start: coeff_to_extended_part ··End: coeff_to_extended_part ..................................................109.360s ··Start: custom_gates_time ··End: custom_gates_time .......................................................77.608s ··Start: permutations_time ··End: permutations_time .......................................................25.542s ··Start: lookup_time ····Start: inputs_inv_sum_time ····End: inputs_inv_sum_time ...................................................100.766s ····Start: lookups_iter_time (cosetfftpart and constraints calculation) ····End: lookups_iter_time .....................................................41.214s ··End: lookup_time .............................................................**141.981s**
[lookup]
Start: phase2 End: phase2 ....................................................................367.301s Start: phase3 End: phase3 ....................................................................306.957s
phase4 time total: 2740.878s <> peak memory: 130GB, for each part:
··Start: coeff_to_extended_part ··End: coeff_to_extended_part ..................................................117.650s ··Start: custom_gates_time ··End: custom_gates_time .......................................................77.640s ··Start: permutations_time ··End: permutations_time .......................................................25.552s ··Start: lookup_time (cosetfftpart and constraints calculation) ··End: lookup_time .............................................................**120.937s**
@Velaciela can you update the benchmark using the latest circuit (v0.9.3)?
A summary of what happens for K lookups into the same table:
Then for each batch:
No helper columns are needed, but the batch size is limited by max degree.
Here is an optimisation:
Instead, for each following batch:
helper = 1 / (denom_0 * denom_1 * …)
This way the multiplicity m and the accumulator are amortized across all lookups. With for example 4 lookups per helper columns, that's a cost of 0.25 columns per lookup + 1 shared multiplicity.
The accumulator counts as the helper of the first batch here. There is also the option to sacrifice this for simplicity; that is, use the same logic for all batches without the exception for the first batch. In that case, that would be 0.25 per lookup + 2.
(cc @spherel)
Note that this pr still has compilation errors. Even though, I will merge this pr into the develop-rc
branch first as i don't have access to geometry's repo. Then i will make necessary changes there.
cc @spherel
a brief doc for this PR https://www.overleaf.com/read/vphfbrztgsmp, any comments/corrections would be greatly appreciated!
A summary of what happens for K lookups into the same table:
* Split K into batches K1, K2, etc, such that each batch respects the max constraint degree.
Then for each batch:
* Create a multiplicity column m. * Create an accumulator. * The constraint works on the difference between steps of the accumulator.
No helper columns are needed, but the batch size is limited by max degree.
Here is an optimisation:
* The first batch K1 works the same as above. * But following batches do not need another instance of the entire argument.
Instead, for each following batch:
* Create a helper column containing inverses: `helper = 1 / (denom_0 * denom_1 * …)` * A constraint verifies the divisions of this batch using the the helper column. * The terms are added to the single accumulator. That does not increase its degree. The multiplicities are also added into the single m.
This way the multiplicity m and the accumulator are amortized across all lookups. With for example 4 lookups per helper columns, that's a cost of 0.25 columns per lookup + 1 shared multiplicity.
The accumulator counts as the helper of the first batch here. There is also the option to sacrifice this for simplicity; that is, use the same logic for all batches without the exception for the first batch. In that case, that would be 0.25 per lookup + 2.
(cc @spherel)
Do I correctly understand that you propose to use one common constraint enforcing where the sum is replaced by the sum of the polynomials A_i(x) found using interpolation, each of which corresponds to a separate batch and is constrained by something like this: where K and f-polynomials a batch-specific?
a brief doc for this PR https://www.overleaf.com/read/vphfbrztgsmp, any comments/corrections would be greatly appreciated!
Good idea and nice paper draft! Two things can be done to make it even better:
"Some identity A(x) = B(x) over some domain, which in an abstract polynomial protocol would be checked by a trusted party, in the case of a concrete protocol based on the polynomial commitment is checked as follows:
These are the papers related to this lookup argument.