Open CMCDragonkai opened 3 years ago
I don't believe we have such a base encoding in multibase, see https://github.com/multiformats/multibase for the current list. I can imagine a use for this (and can imagine how Dean is using it too!) so it might be worth getting an entry in to the multibase table (the linked repo, see multibase.csv). It might take a bit of discussion with others over there as we haven't added a new one in a while and it doesn't get that much attention. It would also be good if you had some working code that worked as a js-multiformats multibase codec (maybe in this repo, although it could also just be an external package that we link in the README here like we do with codecs and hashers). Unfortunately Dean's code is using Buffer
which we try and avoid around here, but it may be pretty straightforward to either reuse the base64 encoder here and do the mapping as a second step, or insert an adjusted dictionary to coding step(s). I'm not sure but there's some test fixtures to work with at least.
This article ~https://github.com/deanlandolt/base64-lex~ https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding explains how to make base64 preserve lexicographic order. From the description it seems base58btc is already in lexicographic order. However I'm not sure what the deal is with padding and endianness.
Sorry I meant this article: https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding
Looking at bit deeper, maybe we're already covered? Have a look at the dictionaries used for the bases in this repo and see which are properly ordered. https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js suggests that base58btc
(but not base58flickr
) might preserve lexicographic order. It doesn't look like any of the base64's are but the base32's seem to be. I'm not sure about the cross-byte interactions, however, maybe that rules out 58 and you'd have to go with 32?
Right, so to expand on the above point, I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.
tbh, I don't even know if that same problem persists for base32 variants; you're going to have to do some experimentation I think
Can you expand on what you mean by this?
I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.
I imagine hex encoding should preserve lexicographic order. But I guess this also depends on what is lexicographic order? My understanding is that the order of a string should be based on the comparison on each individual byte in the string. Since bytes are just 8 bit numbers, then if you have [0, 255]
that should be less than [1, 255]
.
In JS, it says string sort is based on utf-16 code units. I'm not sure if that is equivalent to my understanding.
Yeah, actually I'm not at all sure how to make this work properly, with all of the bases (that are not base-256) you're dividing up bytes into pieces and having to span across bytes as you slice. Even for base64 you're slicing the bytes into 6-bit pieces, so you get the first 6 bits of a byte, then the remaining 2 and 4 of the next, then the remaining 4 of the next and 2 at the start of the next, then the remaining 6, etc. etc. So how do you retain proper sorting when you're splitting up bytes like that? Maybe I'm not fully grokking what base64-lex is doing but I find it difficult to imagine. Base-16/hex is splitting bytes in half and presenting 2 characters for each byte, I guess that might retain sorting if you did it right. Then it all comes down to endianness which is just a mess! This is definitely something worth experimenting with, make random strings, encode them and see if they retain the sorting order you want.
I did tests on 100,000 Ids generated by base58btc, they were all correctly sorted.
Whereas less than 100 ids generated with base64 and I got a sort order error. So I'm guessing base58btc is suffiicent.
Did my tests here: https://github.com/MatrixAI/js-id/pull/8
Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar.
Does it work in the same principle that hex encoding works. For hex,
every byte can be presented by 2 hex digits. So 0 - F
is half of a
byte, then 0 - F
is the next half.
Hex encoding's bit-order also make sense. So for FF
, the left most hex
is the most significant 4 bits, and the right hex is the least
significant 4 bits. Thus the ordering is correct for 01
all the way to
FF
.
Hex numbering is also 0-1 then A-F, and that is also correct in terms of lexicographic order.
For a string of multiple bytes, each byte gets encoded into 2 hex digits
(if padded), or just 1 or 2 hex digits. These should be equivalent from
a sorting perspective. Anyway for [123, 124]
that becomes 7b7c
and
[124, 125]
is 7c7d
. Imagine we wanted to compare each other for
sorting. Assuming lexicographic sort is based on each character (which
in this case it is not precisely the case, because JS uses utf-16), then
we are comparing each character by character.
7b7c
7c7d
The 7b7c
would be considered to come before 7c7d
, simply at index 1:
b < c
.
In base58btc the exact relationship between bytes to digits is not as exact as hex.
On 22/10/2021 1:20 pm, Rod Vagg wrote:
Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/multiformats/js-multiformats/issues/124#issuecomment-949230824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHM4PJ44Q2KP7OJW4ITUIDC6HANCNFSM5GNCQ6OA.
I realise that I don't understand base58btc internals, I can do a test with 100,000 random strings, and sort them, then convert them to base58btc and then sort it again and compare for a sanity check.
Ok my tests show that base58btc
doesn't preserve lexicographic order. It did in my js-id
tests, but that's probably due to the timestamp portion at the beginning of the data.
import type { Codec } from 'multiformats/bases/base';
import crypto from 'crypto';
import { bases } from 'multiformats/basics';
function randomBytes(size: number): Uint8Array {
return crypto.randomBytes(size);
}
type MultibaseFormats = keyof typeof bases;
const basesByPrefix: Record<string, Codec<string, string>> = {};
for (const k in bases) {
const codec = bases[k];
basesByPrefix[codec.prefix] = codec;
}
function toMultibase(id: Uint8Array, format: MultibaseFormats): string {
const codec = bases[format];
return codec.encode(id);
}
function fromMultibase(idString: string): Uint8Array | undefined {
const prefix = idString[0];
const codec = basesByPrefix[prefix];
if (codec == null) {
return;
}
const buffer = codec.decode(idString);
return buffer;
}
const originalList: Array<Uint8Array> = [];
const total = 100000;
let count = total;
while (count) {
originalList.push(randomBytes(36));
count--;
}
originalList.sort(Buffer.compare);
const encodedList = originalList.map(
(bs) => toMultibase(bs, 'base58btc')
);
const encodedList_ = encodedList.slice();
encodedList_.sort();
// encodedList is the same order as originalList
// if base58btc preserves lexicographic-order
// then encodedList_ would be the same order
for (let i = 0; i < total; i++) {
if (encodedList[i] !== encodedList_[i]) {
console.log('Does not match on:', i);
console.log('original order', encodedList[i]);
console.log('encoded order', encodedList_[i]);
break;
}
}
const decodedList = encodedList.map(fromMultibase);
for (let i = 0; i < total; i++) {
// @ts-ignore
if (!originalList[i].equals(Buffer.from(decodedList[i]))) {
console.log('bug in the code');
break;
}
}
Changing it to base16
does preserve order.
Looking at the errors and trialing them out:
base2 - no error
base8 - no error
base16 - no error
base16upper - no error
base32hex - no error
base32hexupper - no error
base32hexpad - no error
base32hexpadupper - no error
base10 - error
base32 - error
base32upper - error
base32pad - error
base32padupper - error
base32z - error
base36 - error
base58btc - error
base58flickr - error
base64 - error
According to rfc4648, the base32hex
does preserve sort order. https://datatracker.ietf.org/doc/html/rfc4648#section-7
So there you have it, the conclusion. Pretty much
JS's binary string form also preserves sort order #122.
I suggest that we should add in a base64 variant that preserves order since base32 is quite a low base. And probably change over identity encoding to binary string form mentioned in #122.
I've come across a situation requiring the ability to base-encode but preserve the lexicographic-order of the input bytes.
That is the order of the base encoding alphabet should be in the same order as those bytes that are mapped to the base encoding.
An example of this library is https://github.com/deanlandolt/base64-lex which is a base64 that preserves lexicographic order.
Which base encodings in multibase support preservation of lexicographic order?