multiformats / multibase

Self identifying base encodings
271 stars 74 forks source link

Consider encoding: WordBase-2048 #89

Open will-richards-ii opened 2 years ago

will-richards-ii commented 2 years ago

Consider use case and factors:

Legal documents such as for a non-profit, corporation, or legally recognized DAO: the author or script wishes to reference a CID, DID, smart contract, key, or other identifiers. I think a QR Code would be preferable. However, not all jurisdictions or filing processes support this. Documents might be printed, photocopied on an old copier, and then rescanned. OCR makes this easier, except when the document is difficult to read by machine, smugged, blurred, faded, or preferred to be checked by hand. Humans can read words more easily and reading words provides an organic type of error correction. Words can also be easily read allow to be voice recognized into another interface.

Proposal:

I suggest the word lists from BIP-39 be used to create a base 2048 in several languages. Primarily first in English. Perhaps a special indicator word/phrase could be used in the entire multiformat use, or the standard could rely entirely on the 2048 words. This would work similarly to a seedphrase. Perhaps in the future seed phrases could even have multiformat self-describing their parameters.

References:

https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md

Out-of-scope:

Machine error correction by additional word guessing and correction coding/checking might also be added to this but is outside the scope of this issue/feature. Seed phrase parameters would have to be another standard/proposal if enough people find this useful.

rvagg commented 2 years ago

@will-richards-ii this is the kind of thing you should work on implementing yourself and seeing if you have utility for it before pushing it further. It's quite a departure from the standard multibase format, but not entirely out of bounds. I'd be interested in seeing an implementation, but as something that's just an idea we can't really act on it here. Some multibase implementations are intentionally flexible and open for you to create your own.

The JavaScript implementation at multiformats/js-multiformats, for instance, is intended to let you bring your own multibase. See https://github.com/multiformats/js-multiformats/blob/master/src/bases/identity.js for a simple example of making your own (we don't yet export base.js, but could do, although it's only a utility file which you could just reimplement for something as custom as this) - you just need a from-binary and a to-binary, you could then even provide your implementation to a cid.toString(base) to see a CID in that multibase.

It's really not very appropriate to open so many issues across the ecosystem to just draw attention to this issue alone, it's quite spammy.

will-richards-ii commented 2 years ago

Sorry for reaching out across the ecosystem so much. Didn't notice the single maintainer. Apologies. I'm willing to post implementations in multiple programming languages.

ShadowJonathan commented 1 year ago

I may have found a potential usecase for this in IPFS Desktop, to allow users to autocomplete words or to make it easier to convey them across vocal channels or the likes.

https://github.com/ipfs/ipfs-desktop/issues/1278


@will-richards-ii I'm only glancing at the wordlist, and deriving this assumption from the encoding name, but; How does wb2048 deal with less-than-2048-bit chunks (at the end)? Does it use a padding marker, or what else?

If there is no such marker, may I suggest extending the wordlist with words that can encode the remaining bits (or use existing words for that), and then a word that contains the to-be-truncated bits?

(i.e. <data word>-<data word>-<marker to truncate next word by X bits>-<data word (interpreted as number, "truncate by X bits")>-<data word (is truncated)>, with the special truncate word being "scissor", or "cut", or something, that marks to truncate the next one)

Could you please demonstrate how a Qm/ba hash would look like in wb2048?

Also, by your expertise or insight, what would you suggest as a first-letter-marker for this encoding? Keep in mind that this needs to be consistent across languages, and not already present in the current encoding table. Personally I'd suggest a symbol (!, @, +, etc.), as that's easier to port and recognise across languages.

About parsing tolerance; Do you think the following regex would be a good way to delineate between words?

([ \n\-_/\\])+

Match one or more times the same character in succession: `,\,/,-` or newline.

ShadowJonathan commented 1 year ago

Ah, I've found the encoding i was looking for originally when following up on the above issue, proquint, which has already been added to multibase.

Personally I think that this is then an alternative to the above addition, and that proquint would serve the same purpose.