Open dane-git opened 3 weeks ago
I calculate the Blake3-256 SAID to be 'EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ'
which matches the output of KERIpy.
Where did you get the EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk
value?
And what is the algorithm you are using? Blake3-256, SHA3-256, or something else?
The change you suggested doesn't make sense because you just omitted the return
statement.
sorry. yes the change needed may be:
return toBytes(code + encodeBase64Url(Buffer.from(combined).slice(ps)))
Expected said from cesr-specification: https://trustoverip.github.io/tswg-cesr-specification/#self-addressing-identifier-said
getting that when running this in python. not using any keripy code, just trying to understand so if incorrect... oops.
import json
import blake3
import base64
def dict_to_said_str(data_dict):
# Convert the dictionary to a JSON string without extra spaces after commas and colons
json_str = json.dumps(data_dict, separators=(',', ':'))
return json_str
def get_blake3_256_digest(data_dict):
# Convert the dictionary to a JSON string and then to bytes
# json_str = json.dumps(data_dict)#, sort_keys=True) # Sort keys for consistent hashing. not in keri. is insert order
json_str = dict_to_said_str(data_dict)
data_bytes = json_str.encode('utf-8')
# Compute the BLAKE3-256 digest
hash_obj = blake3.blake3(data_bytes)
digest = hash_obj.hexdigest()
return digest
def hex_to_base64(hex_str, is_keri=False, prefix_code='E'):
# Decode the hex string to bytes
byte_data = bytes.fromhex(hex_str)
# Encode the bytes to base64
base64_str = base64.b64encode(byte_data).decode('utf-8')
if is_keri:
base64_str = base64_str.replace('+','-').replace('/','_').replace('=', '')
base64_str = prefix_code +base64_str
return base64_str
# Example usage
data = {
'd': '############################################',
'attr1': 'value1',
'attr2': 'value2',
'attr3': 'value3',
}
# data = {
# 'd': '',
# 'attr1': 'value1',
# 'attr2': 'value2',
# 'attr3': 'value3',
# }
f = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
}
# digest = get_blake3_256_digest(my_dict)
print(len(f['said']))
digest = get_blake3_256_digest(f)
print(f"BLAKE3-256 Digest: {digest}")
print(hex_to_base64(digest, True))
my_said = hex_to_base64(digest, True)
spec_said = "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk"
print(my_said)
print(spec_said)
print(my_said == spec_said)
First, the SAID EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk
in the spec is incorrect as it is missing the step of "fully qualifying" the raw bytes of the digest, meaning pre-pending the correct number of zero bytes prior to performing the Base64URLSafe encoding, and replacing the zero bytes with the correct derivation code from the master code table. This performs what is called aligning the encoded data on 24 bit boundaries. We need to submit a PR to the spec to correct the digest to be 'EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ'
.
Second, your code is missing same thing, aligning the bytes of your raw digest bytes on 24 bit boundaries prior to performing a Base64URLSafe encoding. This alignment on 24-bit boundaries with prepended zero bytes is a CESR-specific encoding feature. I will walk you through the math to get to 24-bit boundaries.
Third, there is no need to convert the digest to and from hex. That code is entirely unnecessary and you can remove it.
Now for the explanation of the root cause of your incorrect digest. This is a long reply because it's been too long since I read the spec and I needed a refresher, so you get to be the beneficiary of my need to re-read. This comment is the start of a blog post to explain concepts related to CESR-encoded SAIDs.
The root of of the difference in your digest from a valid digest comes from something you may have missed while reading the spec, the deeper meaning of the following language in section section 12.6.1:
... Encode the computed digest with CESR [CESR] to create the final derived and encoded SAID of the same total length as the dummy string and the copied embedded SAID. ...
When the spec says to "Encode the computed digest with CESR" then it means that in order to get a CESR-spec compliant digest that you must encode the raw bytes of your digest in the same way that CESR does. CESR uses 24-bit alignment for all of its encodings.
Note: in this answer when I say Base64 I always mean the Base64URLSafe version of that encoding.
What doest that mean and why is it used? When the spec says "create the final derived and encoded" SAID then what do derived and encoded mean?
Encoded means that the correct number of pad bytes have been prepended to the raw bytes of the cryptographic primitive prior to encoding as Base64URLSafe characters. These pad bytes result in one or more "A" pad characters representing zero bits being prepended to the front (left of) the resulting Base64URLSafe encoding of the combination of the pad bytes and the raw cryptographic primitive.
This creates a non-final string of encoded Base64URLSafe characters that looks like this:
AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
Note the "A" character at the beginning of the string.
Derived means that the pad "A" characters are replaced with the correct Base64URLSafe derivation code string characters from the appropriate entry in the master code table.
This results in a final string of encoded, derived characters that looks like this:
AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
↓
EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
Note the "A" was replaced with the "E" entry as shown in the following entry from the master code table.
Code | Description | Code Length | Count Length | Total Length |
---|---|---|---|---|
E | Blake3-256 Digest | 1 | 44 |
What it means is that when you get the raw bytes of your digest from your selected hashing library, in this case Blake3 256, then you must prepend blank (zero) bytes on the front of that raw output in order to align the resulting size of the prepended byte array on a 24 bit boundary. The "A" character represents these zero bytes in the Base64 URLSafe encoding as the number zero corresponds to the first character "A" in the zero-indexed Base64URLSafe character array.
Why a 24 bit boundary? This is the "composability" property of CESR, Composable Event Streaming Representation. CESR's support for simple "round-trippable, lossless conversion between Text and Binary domain representations" requires alignment on 24 bit boundaries, which is really means that there is no overlap in Base64URLSafe characters for two primitives that are next to each other in a stream.
Aligning on 24 bit boundaries allows incremental and efficient convertibility and pipelining for parsers processing the raw bytes of cryptographic primitives. This alignment allows all at once (en-masse) conversion of a CESR stream back and forth between the Base64 text domain and the binary domain. It also enables pipelining (spread out to multiple CPUs) the load of parsing CESR streams and simpler parser implementations because the parser can easily count the bytes of a padded primitive and strip it out of the stream incrementally. Without clean 24 bit boundaries then the stream parser would have to wait until the entire stream is received in order to start parsing and detect boundaries between primitives which means writing an incremental parser would be more complicated and less performant. Separating primitives with 24 bit boundaries ensures there are no overlapping bytes because there are no primitives with overlapping bytes due to always aligning on 24 bit boundaries.
Using zero-padded 24 bit boundaries provides a clean, complete separation between Base64-encoded cryptographic primitives. By clean I mean that no Base64 character encodes data for more than one underlying cryptographic primitive, like a digest or other attachment (primitive).
If you have primitives that break the 24 bit boundary and overlap Base64 bytes then there's no clean way to convert them en-masse without parsing the entire stream because you don't know where one cryptographic primitive begins and the next one ends without parsing the entire stream as Base64, converting it to raw bytes, and then counting bytes per primitive prior to instantiating classes from raw primitives. Sharing Base64 characters would require a complete parsing of the byte stream in order to process any primitive, which is not efficient and is not easily or simply round-trippable.
24 is the number of bits to use as the bit boundary for alignment because 24 is the least common multiple of 6 and 8. Why 6 and 8? 6 is the number of bits that can be encoded in a Base64 character. 8 is the number of bits that can be stored in one byte. One of the design goals of CESR was to avoid sharing bits between Base64 characters so that you could easily pipeline the TLV (type length, value) encoded CESR primitives by just sniffing the front of the raw bytes of the primitive which is where the count code is ("E" in the case of the Blake3-256 digest in your code).
In order to avoid sharing bits between Base64URLSafe characters then you must do something that enables you to neatly slice up an 8-bit bounded byte stream into a 6-bit bounded byte stream. Pre-padding with empty zero bytes is the way CESR accomplishes this, and thus the way Saidify accomplishes this in the qb64b function, which is really just a reinterpretation of the Matter._infil function from KERIpy.
For the provided example data:
f = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
}
the unencoded, non-padded Blake3-256 raw digest is nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk
which is 32 bytes and 43 Base64 characters long (42 + 1 for the remainder).
Here's how the math works: Since the kind of Blake3-256 digest we are using for SAIDs is 32 bytes long then it would take at least 43 Base64 characters to encode all 32 bytes. This is calculated as follows:
32 bytes * 8 bits per byte = 256 bits 256 bits / 6 bits per Base64 character = 42.667 Round 42.667 up to 43 since you can't have a fractional part of a Base64 character, you can only have whole Base64 charcters.
The plain digest is not self-framing so we need to add a self-framing code from the CESR master code table. For the Blake3-256 digest the Total Length in Base64URLSafe characters we see the entry with the "E" derivation code:
Code | Description | Code Length | Count Length | Total Length |
---|---|---|---|---|
E | Blake3-256 Digest | 1 | 44 |
As we see we have 43 Base64 characters, not 44, so where does this 44th character come from?
The 44th character, prepended onto the front of the Base64 encoded digest, comes from the pad size indicated by the self-framing derivation code that is prefixed on the front (left side) of the primitive. For example, in the following digest the 'E' character is a Base64 character that is prefixed on the rest of the primitive: | derivation code (1) | primitive (43) | combined primitive (44) |
---|---|---|---|
E | LLbizIr2FJLHexNkiLZpsTWfhwUmZUicuhmoZ9049Hz | ELLbizIr2FJLHexNkiLZpsTWfhwUmZUicuhmoZ9049Hz |
In order to be composable and self-framing then the raw primitive must align on a 24-bit boundary. 256 does not align on a 24 bit boundary as it is not a multiple of 24. We have to add pad bytes (zero bytes) until we get to a multiple of 24 to reach this alignment. Creating this alignment typically occurs with some sort of pad character which in most cases is a zero byte of some sort.
In a non-CESR compliant, naive Base64 encoding (not URL-safe, not composable) then equals signs "=" representing zero bytes are used to pad the end of a naive Base64 encoded value to ensure alignment on 24-bit boundaries. If you have used regular (naive) Base64 encoding then you have seen the equals signs at the end of naive Base64 encoded values. In CESR, rather than using appended equals signs (zero bytes) then it uses prepended 'A' characters (zero bytes) that are concatenated to the raw bytes of the underlying cryptographic primitive.
Back to our calculation, as the spec says,
in order to cleanly capture integer multiples of twenty-four bits of information, Primitive lengths MUST be integer multiples of either four Base64 text characters or three binary bytes.
Let's start from raw digest bytes.
So to make a Blake3-256 digest that is 32 bytes long a multiple of 24 bits then we need to add one more byte. 264 is the multiple of 24 that is closest to 256.
264 bits / 8 bits per byte = 33 bytes
264 bits - 256 bits = 8 bits or 1 byte
32 bytes for the digest + 1 pad byte = 33 bytes.
So 33 bytes is the byte-representation of the closest multiple of 24 bits.
Take our 32 byte Blake3-256 digest and add one zero byte on the front of it to get 33 bytes. In KERIpy parlance this the start of what is called a fully qualified
cryptographic primitive. Prepending the correct number of zero bytes is the first of two steps to "fully qualify" the raw byte array of the cryptographic primitive. It is understood that when converting the cryptographic primitive from the fully qualified, encoded format back to the raw, in-memory format that the prepended zero bytes will be stripped.
Primitive Raw Bytes (unencoded)
|r32:r31:r30:r29:28:r27:r26:r25:r24:r23:r22:r21:r20:r19:r18:r17:r16:r15:r14:r13:r12:r11:r10:r9:r8:r7:r6:r5:r4:r3:r2:r1:r0|
Pad Byte of zeroes (unencoded)
|p33|
Padded Primitive - Raw Bytes
|p33 + r32:r31:r30:r29:28:r27:r26:r25:r24:r23:r22:r21:r20:r19:r18:r17:r16:r15:r14:r13:r12:r11:r10:r9:r8:r7:r6:r5:r4:r3:r2:r1:r0|
Once you have the padded, raw bytes then you encode them as a Base64URLSafe string.
The encoding of the raw, unpadded bytes of the Blake3-256 digest from above looks like the following. It is 43 Base64URLSafe characters, representing 256 bits (32 bytes), long:
nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk
Yet this is incorrect because it does not include the pad byte required to align on a 24 bit boundary.
Adding the pad byte, bringing us to a multiple of 24, up to 264, gives us a different encoded result that is 44 Base64 characters long:
AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
You can see this adds the pad character 'A', since they all represent zero, for the first six bits, in the 33rd byte.
The below representation text visualizes (one indexed, not zero indexed) both the raw bytes on the top as well as the encoded Base64 bytes on the bottom where P33 is the left-most padded zero byte and R32 is the leftmost byte of the raw primitive that includes two zero bits from the single pad byte.
P33
= leftmost pad byte of zero bits
R32
= leftmost raw primitive byte
R31
= second to last left raw primitive byte
EP44
= leftmost encoded Base64 pad character ('A')
EP43
= second encoded Base64 character, some pad bytes (zeroes) from P33, some raw primitive bytes from R32
ER42
= first encoded Base64 character that is all raw primitive bytes
ER42
= second encoded Base64 character that is all raw primitive bytes
byte index -> | P33 | R32 | R31 | ...
bit index -> |p7:p6:p5:p4:p3:p2:p1:p0|r7:r6:r5:r4:r3:r2:r1:r0|r7:r6:r5:r4:r3:r2:r1:r0| ...
bit label -> | ------ PAD BITS ----- | ------- RAW PRIMITIVE BITS OF DIGEST -------- | ...
Raw Bits -> | 0: 0: 0: 0: 0: 0: 0: 0| 1: 0: 0: 1: 1: 1: 0: 0| 1: 0: 1: 0: 0: 1: 1: 0| ...
Base64 Index -> | EP44 | EP43 | ER42 | ER41 | ...
Base64 Char -> | A | J | y | m | ...
To complete the full qualification of the cryptographic primitive then you need to replace the "A" Base64URLSafe characters (padded zero bytes) with the correct characters corresponding to the derivation code from the master code table for your cryptographic primitive. The way the derivation codes were designed is so that when they replace the "A" characters that they take up the exact same count of characters and bytes to preserve the 24-bit boundary alignment. For the Blake3-256 digest this is as follows:
Code | Description | Code Length | Count Length | Total Length |
---|---|---|---|---|
E | Blake3-256 Digest | 1 | 44 |
Where does the 44 come from in the total length? The total length here is not bytes, it is Base64URLSafe characters. To calculate the number of bytes needed to hold a 44 Base64 character primitive then you must convert the number of Base64 encoded bits to bytes. Since each Base64 character encodes 6 bits of information then you multiply 44 * 6.
44 Base64 characters (Total Length) * 6 bits per Base64 character = 264 bits
264 is a multiple of 24: 11 * 24 = 264
264 is the closest multiple of 24 to 256 and 256 is the number of bits in our 32 byte digest (32 * 8 = 256).
To get the number of bytes we need then we take 264 bits and divide by 8 bits per byte, giving us 33 bytes. We only need to add one zero byte to the Base64URLSafe encoding of the raw bytes (32) to get to a multiple of 24 bits (33 bytes / 264 bits).
256 bits (32 bytes) + 8 zero pad bits (1 byte) = 264 bits (33 bytes)
Sections 9.3 and 9.3.1 of the spec detail converting between 6 and 8.
Now that you have the padded cryptographic primitive then you convert it to the Base64URLSafe encoding as called out in the spec. This is the first step to creating the fully qualified cryptographic primitive.
We take the Base64 encoding of the raw, padded bytes:
AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
replace the 'A' with the 'E' derivation code, and arrive at the fully qualified cryptographic primitive, completely CESR encoded:
EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
The code you started from
data = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
}
digest = make_blake3_raw_digest(data) # raw bytes of `nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk`
aligned = align_on_24_bits(digest) # raw bytes of `AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ`
encoded = substitute_derivation_code(aligned) # returns `EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
I added the fictional functions make_blake3_raw_digest
, align_on_24_bits
, and substitute_derivation_code
to illustrate the process. You can find deriveBlake3_256 and qb64b that accomplish these needs.
So, while you did the initial Blake3-256 digest step correctly you missed the two steps of aligning on a 24 bit boundary and substituting the zero-bit 'A' pad characters with the correct derivation code from the CESR master code table.
I know this is a long read yet it is comprehensive. Let me know if you need further clarification. I will close this issue in a few days since I have exhaustively answered it.
Wow! Fantastic. Thanks Kent!
Indeed I did not fully understand or implement the 24 bit alignment part. Was just trying to get there a bit too fast referencing the SAID in the spec, for the part I thought I understood.
I was just using the to hex part as a lazy way to get to the bytes of the digest. In this case, it seems to work to simply add the missing '0' bits (0x00) to the hex digest as a couple of hex zeros to get the missing 8 bits and satisfy the alignment, but that would only work if the deviation was some multiple of 4.
for my own understanding I have re-written the code to hopefully get the correct SAID. I realize below may not be the correct or efficient way to do it, just trying to get to a point of actual understanding. Thanks for your help with this!
import json
import blake3
import base64
CORRECT_SAID = 'EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ' # per Kent
f = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
}
def blake3_to_byte_array(data):
# compute the BLAKE3 hash and get the bytes (which is a byte array)
hasher = blake3.blake3()
hasher.update(data)
byte_array = hasher.digest()
return byte_array
def dict_to_said_str(data_dict):
# Convert the dictionary to a JSON string without extra spaces after commas and colons
json_str = json.dumps(data_dict, separators=(',', ':'))
return json_str
def blake3_256_from_dict(dictionary):
# get blake3_256 byte array from dictionary
## dictionary -> compliant string -> encoded bytes -> blake3 -> byte array
return blake3_to_byte_array(dict_to_said_str(dictionary).encode('utf-8'))
def nearest_higher_multiple(x, n):
# Calculate the nearest higher multiple of n greater than or equal to x
if x % n == 0:
return x
else:
return ((x // n) + 1) * n
def calc_pad_bits(byte_array, alignment = 24):
# get bits in byte_arr
bit_num = len(byte_array) * 8
# get the pad bits needed based on num of bits in byte array
nhm = nearest_higher_multiple(bit_num, alignment)
return nhm - bit_num
def pad_byte_array(byte_array, num_bits, bit_value=0):
# ensure bit_value is either 0 or 1
if bit_value not in (0, 1):
raise ValueError("bit_value must be 0 or 1")
# calculate full bytes and remaining bits to pad
full_bytes, remaining_bits = divmod(num_bits, 8)
# create the full byte padding
pad_byte = 0xFF if bit_value == 1 else 0x00
padding = bytes([pad_byte]) * full_bytes
# if there are remaining bits to pad
if remaining_bits > 0:
# make a partial byte for the remaining bits
partial_byte = (0xFF >> (8 - remaining_bits)) if bit_value == 1 else 0x00
partial_byte <<= (8 - remaining_bits)
padding += bytes([partial_byte])
# concatenate the padding and the original byte array
return padding + byte_array
def byte_array_to_urlsafe_base64(byte_array):
# encode the byte array to a Base64 string
base64_str = base64.urlsafe_b64encode(byte_array).decode('utf-8')
return base64_str
def substitute_derivation_code(aligned, deriv_code, padded):
# each base64 is 6 bits.
# calc prepended base64 chars and remove, then add derivation code.
return deriv_code + aligned[padded//6:]
# get blake3 byte arr from dictionary per keri specs
blake3_byte_arr = blake3_256_from_dict(f)
# get pad bits needed based on byte array length
to_pad = calc_pad_bits(blake3_byte_arr,24)
# get new byte array with correct padding for alignment
aligned_arr = pad_byte_array(blake3_byte_arr, to_pad, 0)
# get url safe b64 digest
b64_digest = byte_array_to_urlsafe_base64(aligned_arr)
# get final said, correcting for shift bits and applying derivation code
said = substitute_derivation_code(b64_digest, 'E', to_pad)
print(said)
print(said == CORRECT_SAID)
Description
According to the spec this dictionary should return this said: "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk"
When tested it returns: "EdI6U1nPCJ6Rpwl6dczeo81T37TuJz8bR6OfyGsDaUpA"
core.ts >> qb64b I think the function returning the buffer before the slice. return toBytes(code + encodeBase64Url(Buffer.from(combined)).slice(ps)) change to: toBytes(code + encodeBase64Url(Buffer.from(combined).slice(ps)))