kentbull / saidify

A library generating self-addressing identifiers (SAIDs).
https://www.npmjs.com/package/saidify
Apache License 2.0
1 stars 1 forks source link

not getting said expected. #11

Open dane-git opened 3 weeks ago

dane-git commented 3 weeks ago

Description

According to the spec this dictionary should return this said: "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk"

// said: "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk",
// said: "############################################",
_dict = {   
      said: "",
      first: "Sue",
      last: "Smith",
      role: "Founder"
    }

When tested it returns: "EdI6U1nPCJ6Rpwl6dczeo81T37TuJz8bR6OfyGsDaUpA"

core.ts >> qb64b I think the function returning the buffer before the slice. return toBytes(code + encodeBase64Url(Buffer.from(combined)).slice(ps)) change to: toBytes(code + encodeBase64Url(Buffer.from(combined).slice(ps)))

kentbull commented 3 weeks ago

I calculate the Blake3-256 SAID to be 'EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ' which matches the output of KERIpy.

Where did you get the EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk value?

And what is the algorithm you are using? Blake3-256, SHA3-256, or something else?

The change you suggested doesn't make sense because you just omitted the return statement.

dane-git commented 3 weeks ago

sorry. yes the change needed may be: return toBytes(code + encodeBase64Url(Buffer.from(combined).slice(ps)))

Expected said from cesr-specification: https://trustoverip.github.io/tswg-cesr-specification/#self-addressing-identifier-said

image

getting that when running this in python. not using any keripy code, just trying to understand so if incorrect... oops.

import json
import blake3
import base64

def dict_to_said_str(data_dict):
    # Convert the dictionary to a JSON string without extra spaces after commas and colons
    json_str = json.dumps(data_dict, separators=(',', ':'))
    return json_str

def get_blake3_256_digest(data_dict):
    # Convert the dictionary to a JSON string and then to bytes
#     json_str = json.dumps(data_dict)#, sort_keys=True)  # Sort keys for consistent hashing. not in keri. is insert order
    json_str = dict_to_said_str(data_dict)
    data_bytes = json_str.encode('utf-8')

    # Compute the BLAKE3-256 digest
    hash_obj = blake3.blake3(data_bytes)
    digest = hash_obj.hexdigest()

    return digest

def hex_to_base64(hex_str, is_keri=False, prefix_code='E'):
    # Decode the hex string to bytes
    byte_data = bytes.fromhex(hex_str)

    # Encode the bytes to base64
    base64_str = base64.b64encode(byte_data).decode('utf-8')
    if is_keri:
      base64_str = base64_str.replace('+','-').replace('/','_').replace('=', '')
      base64_str = prefix_code +base64_str

    return base64_str
# Example usage
data = {
      'd': '############################################',
      'attr1': 'value1',
      'attr2': 'value2',
      'attr3': 'value3',
    }
# data = {
#       'd': '',
#       'attr1': 'value1',
#       'attr2': 'value2',
#       'attr3': 'value3',
#     }
f = {
    "said": "############################################",
    "first": "Sue",
    "last": "Smith",
    "role": "Founder"
}
# digest = get_blake3_256_digest(my_dict)
print(len(f['said']))
digest = get_blake3_256_digest(f)
print(f"BLAKE3-256 Digest: {digest}")
print(hex_to_base64(digest, True))
my_said = hex_to_base64(digest, True)
spec_said = "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk"
print(my_said)
print(spec_said)
print(my_said == spec_said)
kentbull commented 2 weeks ago

First

First, the SAID EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk in the spec is incorrect as it is missing the step of "fully qualifying" the raw bytes of the digest, meaning pre-pending the correct number of zero bytes prior to performing the Base64URLSafe encoding, and replacing the zero bytes with the correct derivation code from the master code table. This performs what is called aligning the encoded data on 24 bit boundaries. We need to submit a PR to the spec to correct the digest to be 'EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ'.

Second

Second, your code is missing same thing, aligning the bytes of your raw digest bytes on 24 bit boundaries prior to performing a Base64URLSafe encoding. This alignment on 24-bit boundaries with prepended zero bytes is a CESR-specific encoding feature. I will walk you through the math to get to 24-bit boundaries.

Third

Third, there is no need to convert the digest to and from hex. That code is entirely unnecessary and you can remove it.

Explanation

Now for the explanation of the root cause of your incorrect digest. This is a long reply because it's been too long since I read the spec and I needed a refresher, so you get to be the beneficiary of my need to re-read. This comment is the start of a blog post to explain concepts related to CESR-encoded SAIDs.

SAID digest must be CESR encoded

The root of of the difference in your digest from a valid digest comes from something you may have missed while reading the spec, the deeper meaning of the following language in section section 12.6.1:

... Encode the computed digest with CESR [CESR] to create the final derived and encoded SAID of the same total length as the dummy string and the copied embedded SAID. ...

When the spec says to "Encode the computed digest with CESR" then it means that in order to get a CESR-spec compliant digest that you must encode the raw bytes of your digest in the same way that CESR does. CESR uses 24-bit alignment for all of its encodings.

Note: in this answer when I say Base64 I always mean the Base64URLSafe version of that encoding.

Why CESR encoded? What does this mean?

What doest that mean and why is it used? When the spec says "create the final derived and encoded" SAID then what do derived and encoded mean?

Encoded

Encoded means that the correct number of pad bytes have been prepended to the raw bytes of the cryptographic primitive prior to encoding as Base64URLSafe characters. These pad bytes result in one or more "A" pad characters representing zero bits being prepended to the front (left of) the resulting Base64URLSafe encoding of the combination of the pad bytes and the raw cryptographic primitive.

This creates a non-final string of encoded Base64URLSafe characters that looks like this:

AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

Note the "A" character at the beginning of the string.

Derived

Derived means that the pad "A" characters are replaced with the correct Base64URLSafe derivation code string characters from the appropriate entry in the master code table.

This results in a final string of encoded, derived characters that looks like this:

AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
↓
EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

Note the "A" was replaced with the "E" entry as shown in the following entry from the master code table.

Code Description Code Length Count Length Total Length
E Blake3-256 Digest 1   44

"ENCODED" - Prepend raw bytes with zero bytes - "A" characters

What it means is that when you get the raw bytes of your digest from your selected hashing library, in this case Blake3 256, then you must prepend blank (zero) bytes on the front of that raw output in order to align the resulting size of the prepended byte array on a 24 bit boundary. The "A" character represents these zero bytes in the Base64 URLSafe encoding as the number zero corresponds to the first character "A" in the zero-indexed Base64URLSafe character array.

24 bit boundary - for composability

Why a 24 bit boundary? This is the "composability" property of CESR, Composable Event Streaming Representation. CESR's support for simple "round-trippable, lossless conversion between Text and Binary domain representations" requires alignment on 24 bit boundaries, which is really means that there is no overlap in Base64URLSafe characters for two primitives that are next to each other in a stream.

En-masse conversions between text (T) and binary (B) -- (T -> B or B -> T)

Aligning on 24 bit boundaries allows incremental and efficient convertibility and pipelining for parsers processing the raw bytes of cryptographic primitives. This alignment allows all at once (en-masse) conversion of a CESR stream back and forth between the Base64 text domain and the binary domain. It also enables pipelining (spread out to multiple CPUs) the load of parsing CESR streams and simpler parser implementations because the parser can easily count the bytes of a padded primitive and strip it out of the stream incrementally. Without clean 24 bit boundaries then the stream parser would have to wait until the entire stream is received in order to start parsing and detect boundaries between primitives which means writing an incremental parser would be more complicated and less performant. Separating primitives with 24 bit boundaries ensures there are no overlapping bytes because there are no primitives with overlapping bytes due to always aligning on 24 bit boundaries.

Using zero-padded 24 bit boundaries provides a clean, complete separation between Base64-encoded cryptographic primitives. By clean I mean that no Base64 character encodes data for more than one underlying cryptographic primitive, like a digest or other attachment (primitive).

If you have primitives that break the 24 bit boundary and overlap Base64 bytes then there's no clean way to convert them en-masse without parsing the entire stream because you don't know where one cryptographic primitive begins and the next one ends without parsing the entire stream as Base64, converting it to raw bytes, and then counting bytes per primitive prior to instantiating classes from raw primitives. Sharing Base64 characters would require a complete parsing of the byte stream in order to process any primitive, which is not efficient and is not easily or simply round-trippable.

24 is least common multiple of 6 (Base64 char bits) and 8 (byte bits)

24 is the number of bits to use as the bit boundary for alignment because 24 is the least common multiple of 6 and 8. Why 6 and 8? 6 is the number of bits that can be encoded in a Base64 character. 8 is the number of bits that can be stored in one byte. One of the design goals of CESR was to avoid sharing bits between Base64 characters so that you could easily pipeline the TLV (type length, value) encoded CESR primitives by just sniffing the front of the raw bytes of the primitive which is where the count code is ("E" in the case of the Blake3-256 digest in your code).

Padding with pre-padded, empty zero bytes

In order to avoid sharing bits between Base64URLSafe characters then you must do something that enables you to neatly slice up an 8-bit bounded byte stream into a 6-bit bounded byte stream. Pre-padding with empty zero bytes is the way CESR accomplishes this, and thus the way Saidify accomplishes this in the qb64b function, which is really just a reinterpretation of the Matter._infil function from KERIpy.

The digest

For the provided example data:

f = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
}

the unencoded, non-padded Blake3-256 raw digest is nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk which is 32 bytes and 43 Base64 characters long (42 + 1 for the remainder).

The math

Here's how the math works: Since the kind of Blake3-256 digest we are using for SAIDs is 32 bytes long then it would take at least 43 Base64 characters to encode all 32 bytes. This is calculated as follows:

32 bytes * 8 bits per byte = 256 bits 256 bits / 6 bits per Base64 character = 42.667 Round 42.667 up to 43 since you can't have a fractional part of a Base64 character, you can only have whole Base64 charcters.

The plain digest is not self-framing so we need to add a self-framing code from the CESR master code table. For the Blake3-256 digest the Total Length in Base64URLSafe characters we see the entry with the "E" derivation code:

Code Description Code Length Count Length Total Length
E Blake3-256 Digest 1   44

As we see we have 43 Base64 characters, not 44, so where does this 44th character come from?

Pad size and pad bytes
The 44th character, prepended onto the front of the Base64 encoded digest, comes from the pad size indicated by the self-framing derivation code that is prefixed on the front (left side) of the primitive. For example, in the following digest the 'E' character is a Base64 character that is prefixed on the rest of the primitive: derivation code (1) primitive (43) combined primitive (44)
E LLbizIr2FJLHexNkiLZpsTWfhwUmZUicuhmoZ9049Hz ELLbizIr2FJLHexNkiLZpsTWfhwUmZUicuhmoZ9049Hz

In order to be composable and self-framing then the raw primitive must align on a 24-bit boundary. 256 does not align on a 24 bit boundary as it is not a multiple of 24. We have to add pad bytes (zero bytes) until we get to a multiple of 24 to reach this alignment. Creating this alignment typically occurs with some sort of pad character which in most cases is a zero byte of some sort.

Comparison to non-CESR, naive Base64 encodings (non URL-safe variant)

In a non-CESR compliant, naive Base64 encoding (not URL-safe, not composable) then equals signs "=" representing zero bytes are used to pad the end of a naive Base64 encoded value to ensure alignment on 24-bit boundaries. If you have used regular (naive) Base64 encoding then you have seen the equals signs at the end of naive Base64 encoded values. In CESR, rather than using appended equals signs (zero bytes) then it uses prepended 'A' characters (zero bytes) that are concatenated to the raw bytes of the underlying cryptographic primitive.

Primitive lengths multiples of 3 bytes or 4 Base64 chars

Back to our calculation, as the spec says,

in order to cleanly capture integer multiples of twenty-four bits of information, Primitive lengths MUST be integer multiples of either four Base64 text characters or three binary bytes.

Calculation of padded primitive

Let's start from raw digest bytes.

So to make a Blake3-256 digest that is 32 bytes long a multiple of 24 bits then we need to add one more byte. 264 is the multiple of 24 that is closest to 256.

264 bits / 8 bits per byte = 33 bytes

264 bits - 256 bits = 8 bits or 1 byte

32 bytes for the digest + 1 pad byte = 33 bytes.

So 33 bytes is the byte-representation of the closest multiple of 24 bits.

Pad the raw digest bytes

Take our 32 byte Blake3-256 digest and add one zero byte on the front of it to get 33 bytes. In KERIpy parlance this the start of what is called a fully qualified cryptographic primitive. Prepending the correct number of zero bytes is the first of two steps to "fully qualify" the raw byte array of the cryptographic primitive. It is understood that when converting the cryptographic primitive from the fully qualified, encoded format back to the raw, in-memory format that the prepended zero bytes will be stripped.

Primitive Raw Bytes (unencoded)
       |r32:r31:r30:r29:28:r27:r26:r25:r24:r23:r22:r21:r20:r19:r18:r17:r16:r15:r14:r13:r12:r11:r10:r9:r8:r7:r6:r5:r4:r3:r2:r1:r0|

Pad Byte of zeroes (unencoded)
|p33|

Padded Primitive - Raw Bytes
|p33 +  r32:r31:r30:r29:28:r27:r26:r25:r24:r23:r22:r21:r20:r19:r18:r17:r16:r15:r14:r13:r12:r11:r10:r9:r8:r7:r6:r5:r4:r3:r2:r1:r0|

Once you have the padded, raw bytes then you encode them as a Base64URLSafe string.

Encode as Base64URLSafe

The encoding of the raw, unpadded bytes of the Blake3-256 digest from above looks like the following. It is 43 Base64URLSafe characters, representing 256 bits (32 bytes), long:

nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk

Yet this is incorrect because it does not include the pad byte required to align on a 24 bit boundary.

Adding the pad byte, bringing us to a multiple of 24, up to 264, gives us a different encoded result that is 44 Base64 characters long:

AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

You can see this adds the pad character 'A', since they all represent zero, for the first six bits, in the 33rd byte.

The below representation text visualizes (one indexed, not zero indexed) both the raw bytes on the top as well as the encoded Base64 bytes on the bottom where P33 is the left-most padded zero byte and R32 is the leftmost byte of the raw primitive that includes two zero bits from the single pad byte.

Illustration of Base64URLSafe, left-padded encoding

P33 = leftmost pad byte of zero bits R32 = leftmost raw primitive byte R31 = second to last left raw primitive byte EP44 = leftmost encoded Base64 pad character ('A') EP43 = second encoded Base64 character, some pad bytes (zeroes) from P33, some raw primitive bytes from R32 ER42 = first encoded Base64 character that is all raw primitive bytes ER42 = second encoded Base64 character that is all raw primitive bytes

byte index   -> |           P33         |           R32         |           R31         | ...
bit index    -> |p7:p6:p5:p4:p3:p2:p1:p0|r7:r6:r5:r4:r3:r2:r1:r0|r7:r6:r5:r4:r3:r2:r1:r0| ...
bit label    -> | ------ PAD BITS ----- | ------- RAW PRIMITIVE BITS OF DIGEST -------- | ...
Raw Bits     -> | 0: 0: 0: 0: 0: 0: 0: 0| 1: 0: 0: 1: 1: 1: 0: 0| 1: 0: 1: 0: 0: 1: 1: 0| ...
Base64 Index -> |       EP44      |       EP43      |       ER42      |       ER41      | ...
Base64 Char  -> |       A         |       J         |       y         |       m         | ...
Replace the pad characters "A" with the derivation code

To complete the full qualification of the cryptographic primitive then you need to replace the "A" Base64URLSafe characters (padded zero bytes) with the correct characters corresponding to the derivation code from the master code table for your cryptographic primitive. The way the derivation codes were designed is so that when they replace the "A" characters that they take up the exact same count of characters and bytes to preserve the 24-bit boundary alignment. For the Blake3-256 digest this is as follows:

Code Description Code Length Count Length Total Length
E Blake3-256 Digest 1   44

Where does the 44 come from in the total length? The total length here is not bytes, it is Base64URLSafe characters. To calculate the number of bytes needed to hold a 44 Base64 character primitive then you must convert the number of Base64 encoded bits to bytes. Since each Base64 character encodes 6 bits of information then you multiply 44 * 6.

44 Base64 characters (Total Length) * 6 bits per Base64 character = 264 bits

264 is a multiple of 24: 11 * 24 = 264

264 is the closest multiple of 24 to 256 and 256 is the number of bits in our 32 byte digest (32 * 8 = 256).

To get the number of bytes we need then we take 264 bits and divide by 8 bits per byte, giving us 33 bytes. We only need to add one zero byte to the Base64URLSafe encoding of the raw bytes (32) to get to a multiple of 24 bits (33 bytes / 264 bits).

256 bits (32 bytes) + 8 zero pad bits (1 byte) = 264 bits (33 bytes)

Sections 9.3 and 9.3.1 of the spec detail converting between 6 and 8.

Now that you have the padded cryptographic primitive then you convert it to the Base64URLSafe encoding as called out in the spec. This is the first step to creating the fully qualified cryptographic primitive.

We take the Base64 encoding of the raw, padded bytes:

AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

replace the 'A' with the 'E' derivation code, and arrive at the fully qualified cryptographic primitive, completely CESR encoded:

EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

Example walkthrough

The code you started from

data = {
    "said": "############################################",
    "first": "Sue",
    "last": "Smith",
    "role": "Founder"
}
digest = make_blake3_raw_digest(data)         # raw bytes of  `nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk`
aligned = align_on_24_bits(digest)            # raw bytes of `AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ`
encoded = substitute_derivation_code(aligned) # returns `EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

I added the fictional functions make_blake3_raw_digest, align_on_24_bits, and substitute_derivation_code to illustrate the process. You can find deriveBlake3_256 and qb64b that accomplish these needs.

Summary

So, while you did the initial Blake3-256 digest step correctly you missed the two steps of aligning on a 24 bit boundary and substituting the zero-bit 'A' pad characters with the correct derivation code from the CESR master code table.

I know this is a long read yet it is comprehensive. Let me know if you need further clarification. I will close this issue in a few days since I have exhaustively answered it.

dane-git commented 2 weeks ago

Wow! Fantastic. Thanks Kent!

Indeed I did not fully understand or implement the 24 bit alignment part. Was just trying to get there a bit too fast referencing the SAID in the spec, for the part I thought I understood.

I was just using the to hex part as a lazy way to get to the bytes of the digest. In this case, it seems to work to simply add the missing '0' bits (0x00) to the hex digest as a couple of hex zeros to get the missing 8 bits and satisfy the alignment, but that would only work if the deviation was some multiple of 4.

for my own understanding I have re-written the code to hopefully get the correct SAID. I realize below may not be the correct or efficient way to do it, just trying to get to a point of actual understanding. Thanks for your help with this!

import json
import blake3
import base64

CORRECT_SAID = 'EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ' # per Kent

f = {
    "said": "############################################",
    "first": "Sue",
    "last": "Smith",
    "role": "Founder"
}

def blake3_to_byte_array(data):
    # compute the BLAKE3 hash and get the bytes (which is a byte array)
    hasher = blake3.blake3()
    hasher.update(data)
    byte_array = hasher.digest()
    return byte_array

def dict_to_said_str(data_dict):
    # Convert the dictionary to a JSON string without extra spaces after commas and colons
    json_str = json.dumps(data_dict, separators=(',', ':'))
    return json_str

def blake3_256_from_dict(dictionary):
    # get blake3_256 byte array from dictionary
    ## dictionary -> compliant string -> encoded bytes -> blake3  -> byte array
    return blake3_to_byte_array(dict_to_said_str(dictionary).encode('utf-8'))

def nearest_higher_multiple(x, n):
    # Calculate the nearest higher multiple of n greater than or equal to x
    if x % n == 0:
        return x
    else:
        return ((x // n) + 1) * n

def calc_pad_bits(byte_array, alignment = 24):
    # get bits in byte_arr
    bit_num = len(byte_array) * 8
    # get the pad bits needed based on num of bits in byte array
    nhm = nearest_higher_multiple(bit_num, alignment)
    return nhm - bit_num

def pad_byte_array(byte_array, num_bits, bit_value=0):
    # ensure bit_value is either 0 or 1
    if bit_value not in (0, 1):
        raise ValueError("bit_value must be 0 or 1")

    # calculate full bytes and remaining bits to pad
    full_bytes, remaining_bits = divmod(num_bits, 8)

    # create the full byte padding
    pad_byte = 0xFF if bit_value == 1 else 0x00
    padding = bytes([pad_byte]) * full_bytes

    # if there are remaining bits to pad
    if remaining_bits > 0:
        # make a partial byte for the remaining bits
        partial_byte = (0xFF >> (8 - remaining_bits)) if bit_value == 1 else 0x00
        partial_byte <<= (8 - remaining_bits)
        padding += bytes([partial_byte])

    # concatenate the padding and the original byte array
    return padding + byte_array

def byte_array_to_urlsafe_base64(byte_array):
    # encode the byte array to a Base64 string
    base64_str = base64.urlsafe_b64encode(byte_array).decode('utf-8')
    return base64_str

def substitute_derivation_code(aligned, deriv_code, padded):
  # each base64 is 6 bits. 
  # calc prepended base64 chars and remove, then add derivation code.
  return deriv_code + aligned[padded//6:]

# get blake3 byte arr from dictionary per keri specs
blake3_byte_arr = blake3_256_from_dict(f)

# get pad bits needed based on  byte array length
to_pad = calc_pad_bits(blake3_byte_arr,24)

# get new byte array with correct padding for alignment
aligned_arr = pad_byte_array(blake3_byte_arr, to_pad, 0)

# get url safe b64 digest
b64_digest = byte_array_to_urlsafe_base64(aligned_arr)

# get final said, correcting for shift bits and applying derivation code
said = substitute_derivation_code(b64_digest, 'E', to_pad)
print(said)
print(said == CORRECT_SAID)