Which of the multibase formats preserves lexicographic order?

CMCDragonkai commented 3 years ago

I've come across a situation requiring the ability to base-encode but preserve the lexicographic-order of the input bytes.

That is the order of the base encoding alphabet should be in the same order as those bytes that are mapped to the base encoding.

An example of this library is https://github.com/deanlandolt/base64-lex which is a base64 that preserves lexicographic order.

Which base encodings in multibase support preservation of lexicographic order?

rvagg commented 3 years ago

I don't believe we have such a base encoding in multibase, see https://github.com/multiformats/multibase for the current list. I can imagine a use for this (and can imagine how Dean is using it too!) so it might be worth getting an entry in to the multibase table (the linked repo, see multibase.csv). It might take a bit of discussion with others over there as we haven't added a new one in a while and it doesn't get that much attention. It would also be good if you had some working code that worked as a js-multiformats multibase codec (maybe in this repo, although it could also just be an external package that we link in the README here like we do with codecs and hashers). Unfortunately Dean's code is using Buffer which we try and avoid around here, but it may be pretty straightforward to either reuse the base64 encoder here and do the mapping as a second step, or insert an adjusted dictionary to coding step(s). I'm not sure but there's some test fixtures to work with at least.

CMCDragonkai commented 3 years ago

This article ~https://github.com/deanlandolt/base64-lex~ https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding explains how to make base64 preserve lexicographic order. From the description it seems base58btc is already in lexicographic order. However I'm not sure what the deal is with padding and endianness.

CMCDragonkai commented 3 years ago

Sorry I meant this article: https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding

rvagg commented 3 years ago

Looking at bit deeper, maybe we're already covered? Have a look at the dictionaries used for the bases in this repo and see which are properly ordered. https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js suggests that base58btc (but not base58flickr) might preserve lexicographic order. It doesn't look like any of the base64's are but the base32's seem to be. I'm not sure about the cross-byte interactions, however, maybe that rules out 58 and you'd have to go with 32?

rvagg commented 3 years ago

Right, so to expand on the above point, I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.

rvagg commented 3 years ago

tbh, I don't even know if that same problem persists for base32 variants; you're going to have to do some experimentation I think

CMCDragonkai commented 3 years ago

Can you expand on what you mean by this?

I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.

I imagine hex encoding should preserve lexicographic order. But I guess this also depends on what is lexicographic order? My understanding is that the order of a string should be based on the comparison on each individual byte in the string. Since bytes are just 8 bit numbers, then if you have [0, 255] that should be less than [1, 255].

In JS, it says string sort is based on utf-16 code units. I'm not sure if that is equivalent to my understanding.

rvagg commented 3 years ago

Yeah, actually I'm not at all sure how to make this work properly, with all of the bases (that are not base-256) you're dividing up bytes into pieces and having to span across bytes as you slice. Even for base64 you're slicing the bytes into 6-bit pieces, so you get the first 6 bits of a byte, then the remaining 2 and 4 of the next, then the remaining 4 of the next and 2 at the start of the next, then the remaining 6, etc. etc. So how do you retain proper sorting when you're splitting up bytes like that? Maybe I'm not fully grokking what base64-lex is doing but I find it difficult to imagine. Base-16/hex is splitting bytes in half and presenting 2 characters for each byte, I guess that might retain sorting if you did it right. Then it all comes down to endianness which is just a mess! This is definitely something worth experimenting with, make random strings, encode them and see if they retain the sorting order you want.

CMCDragonkai commented 3 years ago

I did tests on 100,000 Ids generated by base58btc, they were all correctly sorted.

Whereas less than 100 ids generated with base64 and I got a sort order error. So I'm guessing base58btc is suffiicent.

Did my tests here: https://github.com/MatrixAI/js-id/pull/8

rvagg commented 3 years ago

Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar.

CMCDragonkai commented 3 years ago

Does it work in the same principle that hex encoding works. For hex, every byte can be presented by 2 hex digits. So 0 - F is half of a byte, then 0 - F is the next half.

Hex encoding's bit-order also make sense. So for FF, the left most hex is the most significant 4 bits, and the right hex is the least significant 4 bits. Thus the ordering is correct for 01 all the way to FF.

Hex numbering is also 0-1 then A-F, and that is also correct in terms of lexicographic order.

For a string of multiple bytes, each byte gets encoded into 2 hex digits (if padded), or just 1 or 2 hex digits. These should be equivalent from a sorting perspective. Anyway for [123, 124] that becomes 7b7c and [124, 125] is 7c7d. Imagine we wanted to compare each other for sorting. Assuming lexicographic sort is based on each character (which in this case it is not precisely the case, because JS uses utf-16), then we are comparing each character by character.

7b7c
7c7d

The 7b7c would be considered to come before 7c7d, simply at index 1: b < c.

In base58btc the exact relationship between bytes to digits is not as exact as hex.

On 22/10/2021 1:20 pm, Rod Vagg wrote:

Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/multiformats/js-multiformats/issues/124#issuecomment-949230824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHM4PJ44Q2KP7OJW4ITUIDC6HANCNFSM5GNCQ6OA.

CMCDragonkai commented 3 years ago

I realise that I don't understand base58btc internals, I can do a test with 100,000 random strings, and sort them, then convert them to base58btc and then sort it again and compare for a sanity check.

CMCDragonkai commented 3 years ago

Ok my tests show that base58btc doesn't preserve lexicographic order. It did in my js-id tests, but that's probably due to the timestamp portion at the beginning of the data.

import type { Codec } from 'multiformats/bases/base';

import crypto from 'crypto';
import { bases } from 'multiformats/basics';

function randomBytes(size: number): Uint8Array {
  return crypto.randomBytes(size);
}

type MultibaseFormats = keyof typeof bases;

const basesByPrefix: Record<string, Codec<string, string>> = {};
for (const k in bases) {
  const codec = bases[k];
  basesByPrefix[codec.prefix] = codec;
}

function toMultibase(id: Uint8Array, format: MultibaseFormats): string {
  const codec = bases[format];
  return codec.encode(id);
}

function fromMultibase(idString: string): Uint8Array | undefined {
  const prefix = idString[0];
  const codec = basesByPrefix[prefix];
  if (codec == null) {
    return;
  }
  const buffer = codec.decode(idString);
  return buffer;
}

const originalList: Array<Uint8Array> = [];

const total = 100000;

let count = total;
while (count) {
  originalList.push(randomBytes(36));
  count--;
}

originalList.sort(Buffer.compare);
const encodedList = originalList.map(
  (bs) => toMultibase(bs, 'base58btc')
);

const encodedList_ = encodedList.slice();
encodedList_.sort();

// encodedList is the same order as originalList
// if base58btc preserves lexicographic-order
// then encodedList_ would be the same order

for (let i = 0; i < total; i++) {
  if (encodedList[i] !== encodedList_[i]) {
    console.log('Does not match on:', i);
    console.log('original order', encodedList[i]);
    console.log('encoded order', encodedList_[i]);
    break;
  }
}

const decodedList = encodedList.map(fromMultibase);
for (let i = 0; i < total; i++) {
  // @ts-ignore
  if (!originalList[i].equals(Buffer.from(decodedList[i]))) {
    console.log('bug in the code');
    break;
  }
}

Changing it to base16 does preserve order.

CMCDragonkai commented 3 years ago

Looking at the errors and trialing them out:

base2 - no error
base8 - no error
base16 - no error
base16upper - no error
base32hex - no error
base32hexupper - no error
base32hexpad - no error
base32hexpadupper  - no error

base10 - error
base32 - error
base32upper - error
base32pad - error
base32padupper - error
base32z - error
base36 - error
base58btc - error
base58flickr - error
base64 - error

According to rfc4648, the base32hex does preserve sort order. https://datatracker.ietf.org/doc/html/rfc4648#section-7

So there you have it, the conclusion. Pretty much

CMCDragonkai commented 3 years ago

JS's binary string form also preserves sort order #122.

CMCDragonkai commented 3 years ago

I suggest that we should add in a base64 variant that preserves order since base32 is quite a low base. And probably change over identity encoding to binary string form mentioned in #122.

multiformats / js-multiformats

Which of the multibase formats preserves lexicographic order? #124