blake2f precompile - Githubissues

Checklist:

[ ] Can do multiple blake2f hash calculations in a single circuit
[ ] For each of those blake2f hash calculations, can do any number of rounds
[ ] Easy to change the layout (number of columns) so different work loads can be handled
[ ] Possible to do 5M rounds in a circuit with height <= 2**25
[ ] Doc explaining how the circuit works on a high level
[ ] Challenge API integration
[ ] There's a blake2f table struct similar to this one for keccak: https://github.com/privacy-scaling-explorations/zkevm-circuits/blob/bfbedfd1f579fdcee75679773e526b18eefcf10a/zkevm-circuits/src/table.rs#L865. This it the API that will be used by external circuits to lookup blake2f hashes in e.g. for keccak other circuits simple lookup the hash of some input data like this: (1, input_rlc, num_bytes, hash). The format for blake2 will probably be a bit different because of how the blake2f precompile works, perhaps something like this
- is_enabled column: Set to 1 only for rows that contain valid hash for the data in the column below, otherwise needs to be set to 0 (so no garbage data can be lookup up). In other words is_enabled is set to 1 once per blake2f once all data has been verified and the hash is calculated.
- num_rounds column: the number of rounds that produced the blake2f hash
- state_rlc column: the rlc of the state vector data
- message_rlc column: the rlc of all the input bytes used in the hash
- offset_counter_0',offset_counter_1' columns: the offset counters
- f column: final block indicator
- lo_hash column : the lowest 128 bits of the blake2 hash
- hi_hash column : the highest 128 bits of the blake2 hash

Description:

Implementation probably similar to sha256, so a good idea to first look at that implementation and use it as a base: https://github.com/privacy-scaling-explorations/zkevm-circuits/pull/756 Presentation: https://docs.google.com/presentation/d/1eHVzLPRIEziCJ_oDuJG0yZ-5klNj_daqioC0UaGDaiw/edit#slide=id.p

Some thoughts on a possible approach.

Data layout

State

For the layout of the state relative offsets can be used to store the 4 state inputs used in G. There's 2 ways the data is stored, we just have to alternate between the two for each round. See below what relative offsets are necessary. There's also the choice to store the necessary data in a single row or a single column, though the row approach seems to make more sense to me for now.

#### The data stored in a row:

0, 4,  8, 12
1, 5,  9, 13
2, 6, 10, 14
3, 7, 11, 15

     ^
     |

0, 5, 10, 15
1, 6, 11, 12
2, 7,  8, 13
3, 4,  9, 14

---------------

-4, -3, -2, -1
-4, -3, -2, -5
-4, -3, -6, -5
-4, -8, -6, -5

===============

0, 5, 10, 15
1, 6, 11, 12
2, 7,  8, 13
3, 4,  9, 14

     ^
     |

0, 4,  8, 12
1, 5,  9, 13
2, 6, 10, 14
3, 7, 11, 15

-------------

-4, -1, -2, -3
-4, -5, -2, -3
-4, -5, -6, -3
-4, -5, -6, -7

#### The data stored in a column:

 0,  1,  2,  3
 4,  5,  6,  7
 8,  9, 10, 11
12, 13, 14, 15

     ^
     |

 0,  1,  2,  3
 5,  6,  7,  4
10, 11,  8,  9
15, 12, 13, 14

---------------

( 0,-4), ( 0,-4), ( 0,-4), ( 0,-4)
( 1,-4), ( 1,-4), ( 1,-4), (1,-4)
( 2,-4), ( 2,-4), (2,-4), (2,-4)
( 3,-4), ( 3,-4), (3,-4), (3,-4)

=============================

 0,  1,  2,  3
 5,  6,  7,  4
10, 11,  8,  9
15, 12, 13, 14

     ^
     |

 0,  1,  2,  3
 4,  5,  6,  7
 8,  9, 10, 11
12, 13, 14, 15

---------------

( 0,-4), ( 0,-4), ( 0,-4), ( 0,-4)
( 3,-4), ( 3,-4), ( 3,-4), ( 3,-4)
( 2,-4), ( 2,-4), ( 2,-4), ( 2,-4)
( 1,-4), ( 1,-4), ( 1,-4), ( 1,-4)

Which rotation to use is decided by a selector in a fixed column (because it's always the same for every round). Alternatively we can also use the lookup approach explained below for m to load in the correct state word on each row, but I think that may be bit less efficient (though could be more flexible, but that's not that important if the number of columns is reasonable).

Message

Two words are required per G and are stored on the row. Which word to use is decided by Sigma and differs per G and per round. The amount of rounds is not fixed.

In the first round we just load in the data, 2 words per row. In the other rounds we have to make sure that this permutated data is the same as the data in the first row. If the number of round was fixed this would be easy, we could just use copy constraints to hardcode the Sigma matrix into the circuit. But the amount of rounds is an input so this approach won't work. Instead we can make use of lookups. In the first round the message values are loaded into a lookup table as (hash_idx, m_idx, value) (note that it's easier to just generate this lookup table on the fly, so this data is not stored in a separate set of columns).. This will allow us to lookup the correct m value in all other rounds. We also need to lookup the correct index into m for the current round we're in. This we can do by using a fixed table column where we load in SIGMA for all possible rounds. Then to get the correct m_idx for a specific round for a specific row we first lookup (round_idx, row_idx, m_idx_a, m_idx_b) in this SIGMA table.

So we have to keep track of some identifiers while hashing:

hash_idx: each time a new hash is started we increment this value (advice column)
round_idx: incremented for each round (advice column)
row_idx: which G call we currently have to do (fixed column)

G

v[a] := (v[a] + v[b] + x) mod 2**w
v[d] := (v[d] ^ v[a]) >>> R1

v[c] := (v[c] + v[d])     mod 2**w
v[b] := (v[b] ^ v[c]) >>> R2

v[a] := (v[a] + v[b] + y) mod 2**w
v[d] := (v[d] ^ v[a]) >>> R3

v[c] := (v[c] + v[d])     mod 2**w
v[b] := (v[b] ^ v[c]) >>> R4

With these calculations I actually think it makes more sense to use the lookup approach and not a bit approach. There's also this extra intermediate result that's used after the mod operations that makes it not ideal. There's also not a lot of reuse that can happen by using bits because the values cannot be reused a lot more when we use bits. The rotation is also mostly byte based + only a single rotation needed so the splitting into parts can easily be done efficiently. So I would just split up the 64-bit words in parts that make the most sense and minimizes the amount of lookups and columns needed to store those parts.

Layout

Main layout would look something like this I think:

| hash_idx | round_idx | row_idx | v_a | v_b | v_c | v_d | x | y | misc helper columns....

Gas considerations

A round only costs 1 gas, which is pretty insane. The circuit may need to be flexible enough to change the layout so that only a single round/row is used. I don't think this will need that many columns, but this is the absolute worst case scenario so shouldn't be the only layout supported.

Notes

v[a] := (v[a] + v[b] + x) mod 2**w

v[a]  64 bits:
chunks of 8 bits -> 8 parts + 1overflow part 

~ 20 chunks

(hash_idx, m_idx, value)

if (round == 0) {
meta.lookup_any(name, |meta| {
    [
         (0.expr(), hash_idx.expr()),
         (0.expr(), m_idx.expr()),
         (0.expr(), value.expr()),
    ]
 });
} else {
meta.lookup_any(name, |meta| {
    [
         (hash_idx.expr(), 0.expr()),
         (m_idx.expr(), 0.expr()),
         (value.expr(), 0.expr()),
    ]
 });
}

meta.lookup_any(name, |meta| {
    [
         (hash_idx.expr(), (round_idx == 0) * hash_idx.expr()),
         (m_idx.expr(), (round_idx == 0) * m_idx.expr()),
         (value.expr(), (round_idx == 0) * value.expr()),
    ]
 });

meta.lookup_any(name, |meta| {
    [
         (hash_idx.expr(), q_enable() * hash_idx.expr()),
         (0,                    q_enable() * round_idx.expr()),
         (m_idx.expr(), q_enable() * m_idx.expr()),
         (value.expr(), q_enable() * value.expr()),
    ]
 });

taikoxyz / zkevm-circuits

blake2f precompile #51