supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
467 stars 177 forks source link

Expose uncompress and hash_to functions for JVM bindings #232

Open nau opened 3 weeks ago

nau commented 3 weeks ago

uncompress and hash_to are not exposed in JVM bindings. This PR exposes those to JVM.

These functions are useful and are used in Cardano and specificly in JVM implementation of Cardano smart contracts VM.

dot-asm commented 3 weeks ago

uncompress and hash_to are not exposed in JVM bindings.

As for hash_to. What is it that makes the current method unusable? As in https://github.com/supranational/blst/blob/master/bindings/java/runnable.java#L17. As for uncompress. The constructors actually can handle either format. You can confirm it by replacing serialize() with compress() in the upper half of the runnable.java and note that the lower half doesn't need any modifications even though the "wire" format is now compressed. Trouble with the suggestion to add specifically [static] uncompress method is that it asks for the question why not the other one, in which case it wouldn't be inappropriate to have a single method that can handle either format for which, say, deserialize would be a more appropriate name...

dot-asm commented 3 weeks ago

What is it that makes the current [hash_to] method unusable?

One of the commit messages says "DST as byte array." Is this it? How come? String is the most natural type for DST...

nau commented 3 weeks ago

One of the commit messages says "DST as byte array." Is this it? How come? String is the most natural type for DST...

Yes, I agree. The problem arises when you try to use non-ascii chars in a string, though. I have a couple of failing tests in Cardano conformance test cases with this input:

bls12_381_G1_hashToGroup(#3f, DST=#12345678)

What seems to happen is that I encode a DST String as Latin, because not every byte array is a valid UTF-16 or UTF-8 encoded string.

SWIG generates code that reads those as UTF-8 encoded, which results in different byte array and thus, different hash/point.

There are 3 solutions:

  1. Overload hash_to as I did
  2. Change SWIG typemap
    %typemap(in) (const byte *STRING, size_t LENGTH) %{
    if ($input->IsArrayBufferView()) {
        auto av = v8::Local<v8::ArrayBufferView>::Cast($input);
        auto buf = av->Buffer();
        $1 = ($1_ltype)buf->GetData() + av->ByteOffset();
        $2 = av->ByteLength();
    } else if ($input->IsString()) {
        auto str = v8::Local<v8::String>::Cast($input);
        $2 = SWIGV8_UTF8_LENGTH(str);
        $1 = ($1_ltype)alloca($2);
        SWIGV8_WRITE_UTF8(str, (char *)$1, $2);
    } else if ($input->IsNull()) {
        $1 = nullptr;
        $2 = 0;
    } else {
        SWIG_exception_fail(SWIG_TypeError, "in method '$symname', "
                                            "expecting <Buffer> or <String>");
    }
    %}

    but I don't know how.

  3. Ignore the tests.

What do you think?

dot-asm commented 3 weeks ago

What does DST=#12345678 mean?

But in either case, I'd say that there is a 4th option, to perform sensible tests that reflect real life. DST-s are determined at the protocol design stage and it's perfectly reasonable to expect that protocol designers, being humans, would choose printable strings, at least in order to communicate it to coworkers ;-)

As for 1. The problem is that it would require corresponding adjustments to all cases where DST is used. Adding just the method in question just in order to pass the test in question begs for the 4th option.

As for 2. The referred snippet is for Javascript, not Java. I don't think one can pull off something similar in JNI. At least not with SWIG... Just in case for reference, the conversion to UTF8 stems from SWIG. It should be noted that it's reasonable to expect the conversion to be successful, because as far as I understand Java strings are maintained "printable." In the sense that not-printable byte sequences are replaced with something "printable" like �, i.e. you can't instantiate a String that won't be convertible to UTF8.

As for 3. Well, I'm not in position to give you the permission :-)

dot-asm commented 3 weeks ago

As for 2. The referred snippet is for Javascript, not Java. I don't think one can pull off something similar in JNI. At least not with SWIG...

And even if it is possible, why would we want to jump through the hoops just to make an arguably non-representative test pass? For reference, as for Javascript, the referred conversion is not DST-specific, it serves multiple purposes.

dot-asm commented 3 weeks ago

you can't instantiate a String that won't be convertible to UTF8.

Hmmm, I might be wrong... At least things vary with LANG environment variable in jshell...

nau commented 3 weeks ago

My main concert is that DST is a byte array, and nowhere in any standard it's specified to be ASCII encoded string. It's assumed and recommended to be so, but not required.

In Cardano, hashToGroup function receives a ByteString arguments, which is any bytes array.

If we treat it as UTF-8 encoded we might get surpising results in some cases.

I don't like this change either, but what do we do? Ignore it?

If we could make sure that Java bindings use ISO-8859-1/Latin1 encoding while encoding/decoding the DST String that would solve the issue all together, I think.

nau commented 3 weeks ago

As for uncompress. Cardano has builtins compress/uncompress.

def bls12_381_G1_compress(p: BLS12_381_G1_Element): ByteString
def bls12_381_G1_uncompress(bs: ByteString): BLS12_381_G1_Element

I didn't find an obvious way to get those in Java bindings. If you could tell me how to do that I'd much appreciate it. And I'll remove this change from the PR.

nau commented 2 weeks ago

Oh, one more thing. For len(DST) > 255 DST = H("H2C-OVERSIZE-DST-" || a_very_long_DST) which results in a byte array that is not a valid UTF-8 string.

So essentially, current implementation has this bug for Java (and probably JavaScript as well)

dot-asm commented 2 weeks ago

I don't like this change either, but what do we do? Ignore it?

Ignore what? Suggested modification? I'm not going to merge it, because it doesn't actually make it possible to use the remaining parts of the Java bindings with non-printable DST, but merely makes an arguably non-representative test pass. Or are we talking about ignoring the remote possibility of using non-printable string for DST? I admit I'm resistant to considering it, because I see no indication that it won't result in an unnecessary moving part/dead weight. In other words I'm disputing it, but I don't write it off totally.

If we could make sure that Java bindings use ISO-8859-1/Latin1 encoding while encoding/decoding the DST String that would solve the issue all together, I think.

On the native side of JNI UTF-8 is all you get, so that a Java String instantiated with ISO-8859-1/Latin1 encoding will be viewed through the UTF-8 lens no matter what. At the same time even .getBytes() from such string varies with LC_CTYPE, i.e. current locale. Indeed, I get byte[1] { -128 }, byte[2] { -62, -128 }, byte[1] { 63 } from the following jshell snippet

var b = new byte[1];
b[0] = -128;
new String(b, "Latin1").getBytes();

It's not unlike that it would be something else on Windows. In other words forget about ISO-8859-1/Latin1.

dot-asm commented 2 weeks ago

I didn't find an obvious way to get those in Java bindings. If you could tell me how to do that I'd much appreciate it.

Again, a_point.compress() returns serialized ~data~ byte[] in compressed format, and new P1_Affine(serialized_bytes_in_compressed_format) returns a point. ("Again" is a reference to the prior suggestion to experiment with runnable.java snippet.) BTW, I have hard time relating to your code references, so could you translate it to more digestible ~language~terms?

dot-asm commented 2 weeks ago

Oh, one more thing. For len(DST) > 255

You just pass the long DST and blst will take care of it. In other words application is not expected to do something about oversized DSTs.