P2AffinesAdd crashes when executing in alphine docker image

maxkaneco commented 1 year ago

I am getting crash on Cgo execution of P2AffinesAdd. Wonder if you guys have any advice? Signature and verification are fine so I guess it only affects certain APIs.

Docker base: golang:1.18-alpine

Crash stacktrace:

fatal: morestack on g0
SIGTRAP: trace trap
PC=0x48c2e2 m=8 sigcode=128
signal arrived during cgo execution

goroutine 49 [syscall]:
runtime.cgocall(0xf6cda0, 0xc000de3480)
        /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc000de3458 sp=0xc000de3420 pc=0x42481c
github.com/photon-storage/blst/bindings/go._Cfunc_blst_p2s_add(0xc000196a20, 0xc003690000, 0x400)
        _cgo_gotypes.go:1115 +0x45 fp=0xc000de3480 sp=0xc000de3458 pc=0x9bafa5
github.com/photon-storage/blst/bindings/go.P2AffinesAdd.func2(0xc000300400?, 0xc000de3508, 0x9baeec?, 0x1543d90)
        /go/pkg/mod/github.com/photon-storage/blst@v1.0.1/bindings/go/blst.go:2599 +0xf3 fp=0xc000de34c8 sp=0xc000de3480 pc=0x9c6833
github.com/photon-storage/blst/bindings/go.P2AffinesAdd({0xc003690000, 0x400, 0x400}, {0x0?, 0x4384d3?, 0x4000?})
        /go/pkg/mod/github.com/photon-storage/blst@v1.0.1/bindings/go/blst.go:2599 +0x65 fp=0xc000de3508 sp=0xc000de34c8 pc=0x9c66e5
github.com/photon-storage/go-photon/crypto/por.(*Verifier).VerifyBlocks(0xc0012e2770, {0xc0034c0000, 0x400, 0xf5394a60a035baa9?},

dot-asm commented 1 year ago

Is it 100% reproducible or does it fail occasionally? Since you mention it explicitly, is it actually alpine-specific, or is it so you simply happen to use it? Since you mention it explicitly, were previous go versions working, or is 1.18 simply the first go version you compile blst with?

It's not impossible to imagine that the problem is due to an uncertainty introduced by _cgoCheckPointer := func(...interface{}) {} in the function in question. Rationale behind it was that since a) we know that the pointers point to pure data and b) we know that the C side doesn't modify pointers, the pointer check is safe to suppress. The check was causing problems but they were deemed to be false positives. But this assessment was done at specific point in time, and might be not entirely correct by now. Hence the questions.

Is caller's code publicly available? I've found go-photon on github, but there is no VerifyBlocks there...

maxkaneco commented 1 year ago

It is 100% reproducible and always fails. I have tried the same docker image in both EC2 host and my macbook, they both fail. I have also tried docker image based on "golang:1.18.8-bullseye", it is fine.

The crashing docker image is based on "golang:1.18-alpine" with "apk --no-cache --update add gcc musl-dev build-base bash git"

maxkaneco commented 1 year ago

Our repo is not ready to open to public yet as it is still very rough and WIP. I paste the function below. Let me know if you need any further information.

// VerifyBlocks used to batch verify POR proofs of multiple blocks
func (v *Verifier) VerifyBlocks(sigs, blks [][]byte, indexes []uint32) error {
    blkSize := len(blks)
    if len(sigs) != blkSize || len(indexes) != blkSize {
        return ErrBlockSizeMismatch
    }

    mus := make([]*big.Int, v.spb)
    for j := range mus {
        mus[j] = big.NewInt(0)
    }
    for i := range blks {
        for j := uint32(0); j < v.spb; j++ {
            s := BlockSector(blks[i], j)
            mus[j] = new(big.Int).Add(mus[j], s)
        }
    }

    sigmas := make([]*blst.P2Affine, blkSize)
    for i, sig := range sigs {
        sigmas[i] = new(blst.P2Affine).Uncompress(sig)
    }
    sgm := blst.P2AffinesAdd(sigmas)

    var aggs []*blst.P2Affine
    for _, index := range indexes {
        aggs = append(aggs, BlockTag(v.oid, index).ToAffine())
    }
    for i, u := range v.us {
        aggs = append(aggs, blst.P2Mult(u, ReduceToScalar(mus[i])))
    }

    p1, err := v.pk.GetRaw()
    if err != nil {
        return err
    }

    a := blst.Fp12MillerLoop(sgm.ToAffine(), blst.P1Generator().ToAffine())
    b := blst.Fp12MillerLoop(blst.P2AffinesAdd(aggs).ToAffine(), p1)
    if !blst.Fp12FinalVerify(a, b) {
        return ErrVerificationFailure
    }
    return nil
}

dot-asm commented 1 year ago

I have also tried docker image based on "golang:1.18.8-bullseye", it is fine.

So it sounds like a musl vs glibc thing... Not that it totally explains the failure on alpine, but at least there is some hint... However, I see no reason for why you wouldn't be able to restructure your code to not rely on an array of pointers. And it would be more efficient too. So instead of sigmas := make([]*blst.P2Affine, blkSize) do sigmas := make([]blst.P1Affine, blkSize), in which case you would call sgm := blst.P2Affines(sigmas).Add(). As for the second Miller loop call. Its input appears to be a sum of points and result of a multi-scalar multiplication. You would be better off using the corresponding method for the latter. Note that there are Add and Mult for both P2Affines and P2s. Latter perform efficient conversions to affine so you don't have to do it one by one... As for scalars. How wide are they? Either way, the best option is to lay them down as flat array of bytes. And there also will be a more efficient multi-Miller-loop interface...

mratsim commented 1 year ago

How many points are you adding?

When going over 1024 G2 points or 2048 G1 points I have a crash as well (https://github.com/mratsim/nim-blscurve/pull/1#issue-1419851542)

It appears here: https://github.com/supranational/blst/blob/6382d67/src/bulk_addition.c#L148-L149

The C-API seems to have a double pointer/array indirection, could it be that the code assumes the points are in batches of 1024/2048?

dot-asm commented 1 year ago

I'm not convinced that this is the same problem. The [original] problem is believed to have everything to do with how Go pointers are handled on the Cgo interface. Or more specifically Go array of Go pointers. And suggestion was to restructure the application code to avoid it.

As for the 2nd question. No, C-API doesn't expect specific batch sizes. But it does claim temporary storage on the stack to accommodate the internal batching, [up to] 288KB. In Go it's not a problem, because C subroutine is executed on own stack, one with "normal" limit, where 288KB are readily available. But can you be hitting your stack ceiling, @mratsim?

dot-asm commented 1 year ago

Just in case for reference, what's up with internal batching. The algorithm in question requires temporary storage, by where to get is from? blst doesn't do heap allocations, so we take it from stack. But it's not safe to assume that arbitrary amount of stack is available, so we iterate in smaller steps. On the other hand we want to amortize inversion costs across larger amount of points. So where to draw the line? Somewhere within 1-2-3% performance penalty...

dot-asm commented 1 year ago

Somewhere within 1-2-3% performance penalty...

Though I think that the evaluation was done with slower inversion subroutine... In other words it might turn out that reducing the internal batch size could be acceptable...

mratsim commented 1 year ago

Long reply, a bit off-topic regarding OP problem then.

As for the 2nd question. No, C-API doesn't expect specific batch sizes. But it does claim temporary storage on the stack to accommodate the internal batching, 288KB. In Go it's not a problem, because C subroutine is executed on own stack, one with "normal" limit, where 288KB are readily available. But can you be hitting your stack ceiling, @mratsim?

It's unlikely, I'm not using MSVC which limits stack to 1MB, my ulimit -a shows 8MB

Just in case for reference, what's up with internal batching. The algorithm in question requires temporary storage, by where to get is from? blst doesn't do heap allocations, so we take it from stack. But it's not safe to assume that arbitrary amount of stack is available, so we iterate in smaller steps. On the other hand we want to amortize inversion costs across larger amount of points. So where to draw the line? Somewhere within 1-2-3% performance penalty...

I don't think anyone has issue with internal batching, in particular quoting myself in my own implementation: https://github.com/mratsim/constantine/blob/53a5729/constantine/math/elliptic/ec_shortweierstrass_batch_ops.nim#L269-L286

We chunk the addition to limit memory usage especially as we allocate on the stack.

From experience in high-performance computing, here are the constraints we want to optimize for

MSVC limits stack to 1MB by default, we want to use a fraction of that.

We want to use a large fraction of L2 cache, but not more.

We want to use a large fraction of the memory addressable by the TLB.

We optimize for hyperthreading with 2 sibling threads (Xeon Phi hyperthreads have 4 siblings). Meaning we want to use less than half the L2 cache so that if run on siblings threads (same physical core), the chunks don't evict each other.

Hardware:

a Raspberry Pi 4 (2019, Cortex A72) has 1MB L2 cache size

Intel Ice Lake (2019, Core 11XXX) and AMD Zen 2 (2019, Ryzen 3XXX) have 512kB L2 cache size

After one chunk is processed we are well within all 64-bit CPU L2 cache bounds as we halve after each chunk.

The part that is confusing to me is the function signature / double pointer indirection, why

void blst_p2s_add(blst_p2 *ret, const blst_p2_affine *const points[], size_t npoints);

over

void blst_p2s_add(blst_p2 *ret, const blst_p2_affine const points[], size_t npoints);

dot-asm commented 1 year ago

The part that is confusing to me is the function signature / double pointer indirection, why

For flexibility. Note that it traverses the array of pointers till NULL, after which it treats the last pointer as an array of points. So that if you have an array of points you can simply blst_p2_affine *temp[2] = { array, NULL} and pass temp. But if the subroutine accepted array of points and you had an array of pointers, then you'd have to allocate an array and copy points...

dot-asm commented 1 year ago

I don't think anyone has issue with internal batching

The remark was more for the OP :-) As already said "just in case" :-)

dot-asm commented 1 year ago

my ulimit -a shows 8MB

Yes, but the main thread can start threads with smaller stacks... I'm not saying that it does, but it's the only thing I can think of give the circumstances. Can you collect the failing address and compare it to the current stack pointer? Just in case, I can pass more points to the subroutines in question...

supranational / blst

P2AffinesAdd crashes when executing in alphine docker image #142