QIP - MessageTransaction Encoded Message Standard

scottdonaldau commented 6 years ago

A standard message encoding format to indicate encoded data in MessageTransaction transactions on the QRL Network. Open for discussion.

0xFF0 commented 6 years ago

I do support the idea of a standard message encoding format.

However, is there a reason why the encoded message needs to start by AFAF ?

With the current specification, the number of possible type of messages is 241 (i.e: 256 - 16 (can't use the hex string A) + 1 (document notarisation)).

Maybe, the encoded message could be implemented this way?

AF XX [MSG_MAX_77_BYTES] YY

AF: indicate that it's an encoded message.
XX: Indicate a unique encoded message transaction type (0xAF would be for document notarisation).
YY: 1 byte checksum.

With this format, there could be 256 possible types of messages (instead of 241). Also, the checksum byte could provide a more robust message.

The checksum at the end of the message could be used to:

Indicate that it’s an encoded message (for validation in the code).
Applications build on top of the MessageTransaction (like coinvote) could use this checksum for a more robust analysis.

Also, by adding a checksum, the possibility of having a “bad” encoded message would be lower (it’s more probable to have a message with AFAF compared to AF [...] YY).

Btw, I think line 38 should be "The remaining seventy-seven"

surg0r commented 6 years ago

I think a checksum on the end of the message field is a nice idea, @0xFF0. I would guess @scottdonaldau wants to preserve the existing format for previous notarisation transactions and therefore wishes to keep AFAF as the message encoder.

It might be useful to scan the blockchain to look for messages which adhere to the notary format and if just a handful support both cases in the explorer. This would allow us to have a more 'perfect' message encoding standard without worrying about deprecating previous iterations.

surg0r commented 6 years ago

I have adapted my coinvote scripts to search for every message_tx using AFAF as the initial encoding bytes such as in the case of notary transactions.

Scanning the entire chain reveals just 25 transactions in total to current blockheight 196107.

So perhaps @scottdonaldau we could alter the two initial encoding bytes to something else and incorporate a checksum, but also support the older format for display purposes (though move to new agreed format in the webwallet/explorer)?

As this is all supranode functionality we can have an explorer page serving up details of the message format with a list of approved message types approved via a github repo?

cyyber commented 6 years ago

I like the idea by @0xFF0 . Although I believe that we can ignore AF, saving 1 byte, and may have the format as specified by @0xFF0 .

XX: Indicate a unique encoded message transaction type (0xAF would be for document notarisation).
YY: 1 byte checksum.

Thus before processing any txn, if XX matches with any encoded message type and YY checksum matches, only then process the encoded message.

Although for document notarization we may have an exceptional case. But for the upcoming encoded messages, we can simply use above mentioned format.

surg0r commented 6 years ago

As discussed @cyyber I would think a single encoding byte would be wrong since there is only a 2^8 chance of random message data being incorrectly encoded.

0xFF0 commented 6 years ago

Thanks @surg0r and @cyyber .

I think the format with a checksum is more than 2^8 chance of being incorrectly encoded.

Here's an example of what I was thinking of:

import hashlib

def isEncodedMessage(p_msg, p_blockHeight):
    if 'AF' in p_msg[0:2]:
        checksum = hashlib.sha256(p_msg[:-2]).hexdigest()[0:2]
        if checksum in p_msg[-2:]:
            return True
        elif p_blockHeight < 196107: #Assuming QIP done at block 196107
            if "AFAFA" in p_msg[0:5]:
                return True

    return False

#Old format
oldDocNotarisation = "AFAFA3" + "D41D8CD98F00B204E9800998ECF8427E" #MD5 empty doc

#New format
msg = "AF00" + "QRL".encode("hex")
valid_msg = msg + hashlib.sha256(msg).hexdigest()[0:2]

print isEncodedMessage(valid_msg, 200000) #True
print isEncodedMessage(msg, 200000) #False
print isEncodedMessage(oldDocNotarisation, 196000) #True
print isEncodedMessage(oldDocNotarisation, 200000) #False

The chance of having a random message with AFAF is currently 2^16

By adding a checksum and saving 1 byte (i.e: AF XX [MSG_MAX_77_BYTES] YY), the chance is far more > than 2^16 because of the hash function involved in the checksum format.

scottdonaldau commented 6 years ago

@0xFF0 Just to be clear here, this isn't a hard fork - this is purely a layer 2 standard format that will live inside a MessageTransaction. There is no block number considerations with respect to this QIP - simply defining the encoding structure so that apps like a block explorer or wallet can integrate seamlessly.

cyyber commented 6 years ago

Yes 2^8 will result into a lot of collision, but its important to decide if 2^16 is enough or do we need more bytes to avoid collision.

Although checksum solves this problem, but once atleast first 2 bytes matches with any kind of message encoding format, checksum needs to be calculated which is only possible processing the full message. Since its a public blockchain, probably 2^16 might not be enough. May be 4 bytes to denote message encoding + 1 byte checksum, may fill the shortage for future purposes.

0xFF0 commented 6 years ago

Thanks @scottdonaldau

I was not thinking about a hard fork. I thought that wallet/explorer did have a way to get the block height associated with the message (I have not looked at the code so sorry if I'm wrong about this). The idea of looking at block height was to keep a backward compatibility with the 25 transactions in the blockchain.

@cyyber If a streaming solution is needed (i.e: determine if it's an encoded message without processing the full message), I agree 2^8 is not enough. But don't you think that a checksum based on sha256 (or something similar) would not be enough?

cyyber commented 6 years ago

@0xFF0 Yes I agree with you considering checksum, it could be enough. But calculating checksum has O(n) time complexity, I was just thinking as in future the number of blocks will increase and once the message encoding bytes matches which has O(1) time complexity, then we may need to calculate checksum which is O(n) time complexity. Now if with 2 bytes message encoding, if chances of collision are pretty low, then its fine checksum will work. But if the chances of collision with 2 bytes message encoding, is not very much low, but if it becomes somewhere around moderate, then probably calculating checksum and then confirming its an invalid message, may consume a CPU processing time, this time will have more effect once the number of blocks will increase. Thus if we have 4 bytes of message encoding, then chances of any collision will drop further, and thus the chances will be very low the message is invalid, even after calculating checksum. This may give a computation advantage in future.

surg0r commented 6 years ago

@0xFF0 @cyyber @scottdonaldau: using a 1 byte encoding byte and a 1 byte checksum (no matter how it is calculated - the hash function used doesn't alter the odds!) actually results in exactly 1 in 2^16 chance of a collision, with random data being incorrectly encoded as a valid message and passing checksum in the new format.

With 2 prepended bytes encoding and a single checksum byte the risk of a collision and false positive encoding falls to 1 in 2^24 (16.77M), but the risk of the checksum passing with random data in the message body (again, regardless of function used) is 1 in 2^8 (256).

Therefore, it makes a bit more sense to me to use a single prepended byte to denote our message format with a two byte checksum. Then, the risk of a random collision being encoded as valid in our message format is still tiny at 1 in 2^24 (16.77M). But the accuracy of the checksum for the payload 77 bytes rises 256 fold with the chance of a false positive checksum falling to 1 in 2^16 (65536) which seems acceptable. If we want even more security then a three byte checksum would reduce false positive checksum accuracy to 1 in 2^24 (16.77M) and overall collision risk down to 1 in 2^32 (4,294M).

TLDR: single prepended encoding byte and two or three byte checksum is better imho - 77 or 76 bytes is still plenty of room for data payload

(As a final afterthought, theoretically we actually don't even need an encoding byte at all, a message can be assumed to be following our format simply with a valid 3 byte checksum, with perhaps the first byte simply pointing to message subtype (i.e. byte 0 = 0 ..notary, 1 = coinvote, 2 = something_else) with the above maximum safety and still leaving 76 bytes for use.

edit: @cyyber pointed out the wording of this comment was very odd and i have fixed :-)

cyyber commented 6 years ago

Plan A (3 or 4 prepended bytes encoding and No hard-coded checksum)

Instead of hard coding checksum, we may leave that on user. The real advantage of not having hard coded checksum, is in case user doesn't care much about the correctness of his data, it can simply take leverage of all the remaining bytes to store the data without checksum. Moreover, if a user needs a checksum of 2 bytes or 4 bytes or any variable length, depending upon his own needs, he can simply add checksum on message field of message transaction. Thus having no hard coded byte for checksum, gives user flexibility to have a variable length checksum.

Plan B (3 or 4 prepended bytes encoding and 1 byte hard-coded checksum)

In case we want to hard code checksum, we can hard code 1 byte checksum. This will force user to have checksum of atleast 1 byte for all the messages, whether he needs it or not. Moreover, if user needs checksum of more than 1 byte, lets say 3 bytes checksum, then user may add 1 byte value of checksum in hard-coded checksum field, and rest 2 bytes in its own message. In such a way, user may achieve a variable length checksum where minimum length of checksum is 1 byte.

I prefer Plan A, as it is more cleaner and makes sense for all kinds of use cases.

surg0r commented 6 years ago

Plan C : 2 prepended encoding bytes (1 encoding byte, 1 message type byte) and 2 byte checksum
Plan D: 1 prepended encoding byte (message type), 3 byte checksum

More encoding bytes offer a greater chance to parse the message and potentially avoid the work of checksum entirely, whereas lengthening the checksum and shortening the prepended encoding offers the same chance of a collision risk but with the benefit of powerful error detection - which comes at a cost of work and complexity (which i don't think is actually an issue for this particular application).

That said I have no strong feelings about which way we go as long as well specified and has open access for community to add message types via github in a dedicated repo :-)

theQRL / qips

QIP - MessageTransaction Encoded Message Standard #4