supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
467 stars 177 forks source link

Add blst_core_verify_pk_in_g1 binding #137

Closed lukanus closed 1 year ago

lukanus commented 1 year ago

Hey blst is missing some go bindings but this one is quite significant for golang-managed threads Ts there a chance to get it merged ? :)

dot-asm commented 1 year ago

If there is need for this tailored/simplified interface, we'd be better off interfacing to the private coreAggregateVerifyPkInG{12}. Because it will perform the operation in two threads and deliver better performance ;-) In other words "no" for this implementation. If you don't want to [or can't] spend time on figuring out how to interface to the coreAggregateVerify, create an issue instead.

lukanus commented 1 year ago

@dot-asm If I'm correct - you're talking about the usecase when you calculate bigger payload and would like to distribute this over more threads. What this particular PR and binding is for - is to process massive amount of smaller and unrelated payloads. In this case we very much like to have this process being ran on one and only one thread. Because other hundreds of payloads are being processed on different system threads on the same time.

dot-asm commented 1 year ago

you're talking about the usecase when you calculate bigger payload

No, even single-input verification benefits from multi-threading. Significantly.

we very much like

Well, it's not really about what any particular party likes, but about what's best overall;-) In the context one can reformulate it as follows. Would using a pair of ~threads~ goroutines for single-input verification be detrimental in the case you describe? The amount of computational work is the same no matter how you slice it. Yes, using a pair of threads for single-input verifications initiated from multiple threads would result in more context switches and a slightly larger memory footprint. But the computational part totally dominates the execution time, and the additional overhead is negligible in comparison. While on the other hand the benefit for another use case would be significant, ~40%.

lukanus commented 1 year ago

ok so let me try to understand the changes you were talking about. That's the function right? https://github.com/supranational/blst/blob/master/bindings/go/blst.go#L555-L562

Imagine I need to verify 10k simultanous payloads, in "the same time". I can run 100 goroutines that would simultanously process 100 payloads. Each payload is exactly one msg (a slice) how come the loop of separate goroutines helps here? The benefit in that case would need to come from asynchronous calculations of blst_aggregated_in_g2 before the main gets returned?

dot-asm commented 1 year ago

Each payload is exactly one msg (a slice) how come the loop of separate goroutines helps here?

For a single input there won't be a loop of goroutines. There will be one additional goroutine called. And if there is a CPU available, the subroutine as a whole will finish significantly sooner. And if there are no free CPUs, it will be a little bit slower because of synchronization overhead.

The benefit in that case would need to come from asynchronous calculations of blst_aggregated_in_g2 before the main gets returned?

Well, what's "main" here? But either way, yes, the gain is that blst_aggregated_in_g2 gets a chance to execute in parallel with blst_pairing_commit. The two are the second most expensive operations. (Just in case for reference, the most expensive operation is the final exponentiation.)

dot-asm commented 1 year ago

the benefit for another use case would be significant, ~40%.

Oops, this is an overstatement. I've simply calculated the ratio between (2*miller_loop + final_exp) and (miller_loop + final_exp), but one thread has more things to do, namely hashing the message. So the gain would be less than 40%, but still significant.

lukanus commented 1 year ago

So, you still assume that you have CPU available :) If I had to process more msg that my CPU can handle in bursts, so I'd need to queue them up, there is no gain for me in adding additional scheduler overhead, introduce additional context switching, doing additional allocations, and time to transfer data though CGO. Neither we need to run go sched that is not really helping in my scenario. As we're talking about different language bindings, it would be great to leave the flow mechanics to the user of this lib.

To be honest I really didn't thought it would be a problem to just enable this particular one, as there are no changes to your code.

If you really dislike enabling this whole thing, maybe then we can create it slightly differently: export blst_aggregated_in_g{1,2) so people can decide to run it on different thread Create and export c function like PAIRING_Aggregate_PK_in_G1 just without https://github.com/supranational/blst/blob/master/src/aggregate.c/#L325-L332 (what as I understand is blst_aggregated_in_g , or is that a commit?) and then, the third one that has both PAIRING_Commit and PAIRING_FinalVerify. Preferably not sharing the underlying c context, but run as separate beings.

This way people can really independently decide how to run, and what's better in their case.
As what you see in my PR - if you see entire operation - performs better in my particular case, with few times less allocations.

dot-asm commented 1 year ago

So, you still assume that you have CPU available :)

Again, the assertion is that if there is one available, it would be a significant win, and if there is none, then it's virtually no loss. Even in the scenario you describe. If you can refute [the second part of] the assertion, do it with data. [And probably with better rationale for why would you start 100 goroutines. If rationale is "because it's cheap," then why 200 isn't?]

As for why not this-n-that. It's not like we're not open for suggestions, but expectation is that it's more concrete. As in "this is a use case, and this is a bottleneck." The objective is an overall experience, not answering speculative questions.