Memory overflow while compiling BLS Signature Verification

onurinanc commented 8 months ago

Aim

Here is my bls signature verification implementation in Noir: https://github.com/onurinanc/noir-bls-signature. When I tried to run this, the cli crashed somehow. There are 2 more persons that tried to run this but having the same issue happens.

Here is the corresponding issue that I got for the bls signature verification compilation: https://github.com/onurinanc/noir-bls-signature/issues/2

As a summary,

@stefan-nikolov96 says that

"I am unable to run test_verify_bls_signature() with over 256GiB in RAM and just as much in swap memory. I get a memory overflow and then the program recieves a SIGKILL signal from the OS.

I tried running nargo compile in debug and release mode and with 0.9.0 and 0.15.0 compiler versions.

GDB shows that the nargo overflows in the compiler frontend. On a machine with 64GiB RAM, it uses about 50GiB during inlining and the program fails during mem2reg https://github.com/noir-lang/noir/blob/master/compiler/noirc_evaluator/src/ssa/opt/mem2reg.rs "

Expected Behavior

compiler shouldn't crash

Bug

memory overflow and then the program recieves a SIGKILL signal from the OS.

To Reproduce

run verify_bls_signature function

To have an issue solutions what's happening in the back, you can navigate to test_proving_time branch in the repository.

When you run nargo prove if you comment the line 326 inside pairing.nr, it is working correctly.

When you not comment this line, it is not going to work. (due to memory issues)

You can the issue quickly using this repo: https://github.com/onurinanc/test-bls-proving-time

Installation Method

Binary

Nargo Version

nargo 0.17.0 (git version hash: 86704bad3af19dd03634cbec0d697ff8159ed683, is dirty: false)

Additional Context

No response

Would you like to submit a PR for this Issue?

No

Support Needs

No response

TomAFrench commented 8 months ago

If we're running into an issue in the mem2reg pass, is it possible this is due to same issue from trying to loop unroll brillig that affected the RSA library?

@onurinanc if you compile this on the nightly version of nargo do you still get the same issue?

jfecher commented 8 months ago

@TomAFrench If it is crashing during mem2reg I doubt they are the same. The brillig issue affecting the RSA library is each ArraySet operation in brillig copying the array. Brillig isn't evaluated until later, and this isn't an issue in SSA, since SSA doesn't copy the whole array for each ArraySet.

From looking at the description real quick, if inlining is taking 50GiB then I'd assume the generated SSA is already quite large. The reality may be that, depending on the source program (which I haven't looked at), it may be necessary. This can be if the source program uses large or nested loops, recursion, etc. which necessitate many SSA instructions once every function is inlined - and later when each loop is unrolled. We can try to reduce memory usage by decreasing the size of each instruction (and freeing no longer used ones more often), but this still may not fix it if the program is too large.

onurinanc commented 8 months ago

@TomAFrench I couldn't compile in both 0.9.0 and 0.17.0

stefan-nikolov96 commented 8 months ago

@TomAFrench, @TomAFrench nargo failed during mem2reg while running it in GDB in debug mode, my PC ran out of memory at that pass, however if I had more memory it would probably overflow further down the line. The memory used may be exaggerated due to nargo being ran in debug mode, however @onurinanc in his documentation of BLS states that he has previously ran bls_verify on a 16GiB machine, therefore I'm assuming it's due to some breaking change in the noirc frontend

onurinanc commented 8 months ago

@stefan-nikolov96 No, I could not able to run it with 16 GiB

wwared commented 8 months ago

I've ran into this same issue trying to run the noir-bn254 repo (https://github.com/onurinanc/noir-bn254/issues/3) in any Noir version past 0.9.0. The issue is the memory usage of the mem2reg pass as well. I'll document my investigation and my findings below to try and help shed some light on the issue:

The main issue seems to be that the first mem2reg optimization pass runs out of memory. Adding debug prints around the code, it seems that in the case of the noir-bn254 repo, the problem is that there are too many basic blocks (around 450k basic blocks) during that first optimization pass, and the value for self.blocks gets larger and larger for every iteration with more and more time spent cloning some very large BTreeMaps.

Specifically, every block iterated always gets stored in self.blocks and never removed until the end of the optimization pass. The problem here is that the Blocks themselves have BTreeMaps which grow as the blocks get processed. I don't have the specific numbers on me, but IIRC the number of elements of the aliases inside each stored block grows as more and more basic blocks get processed by mem2reg, which happens as the basic blocks get unified and the alias sets of the blocks gets combined (since the non-empty sets just get union'd together). The problem is that every time a block is unified, the alias set unification creates a full clone of the entire (growing) alias set for every expression, and each block in self.blocks contains a copy of this ever-growing map. Since there is no memory sharing or pruning of any of these BTreeMaps, the memory usage grows until OOM.

I've tried understanding a bit more about mem2reg by taking a look at https://hackmd.io/@JakeF/S1FA9rTnh and I believe that this issue could be partly mitigated by pruning blocks from the self.blocks list once we're sure they're no longer necessary (all of their successors have already been processed, so find_starting_references will never return the block anymore and it can be safely removed), and things could be improved even further by pruning the alias set or sharing memory between copies of Blocks.

I've done some rudimentary experiments, first by essentially disabling the implementation by making find_starting_references always return Block::default() and never inserting into self.blocks. This keeps memory usage very low, but as expected it turns off a very important optimization and the resulting ACIR might have issues and is way too unoptimized (though it confirms the other SSA passes are not the cause for the OOM). Another thing I've attempted is adding processed blocks to a BTreeSet and then regularly pruning self.blocks by removing all blocks whose successors have already been processed. Here is a patch showing my attempt (with some extra debug prints). Doing this allowed the memory usage to be low enough for it to finish after giving it about 300GB of memory (RAM + swap), but the resulting ACIR from compiling with this patched noir was still very slow to execute despite the compilation actually finishing (to be precise, running noir in release mode and increasing the stack limit with ulimit allowed the compilation to finish after using about 250-300GB of memory in a few minutes, but then evaluating the ACIR to actually run the test after all compilation was finished ended up taking about 35 hours to complete, though it did successfully execute without erroring). Also weirdly, the number of elements in self.blocks never went as low as I expected, it seems that only roughly 1/4th to 1/5th of the blocks added were being removed by my retain call, so I might have misunderstood how the code works.

Another (mostly unrelated) issue I've ran into, because the compiled code has so many basic blocks, is that after getting past mem2reg, the flatten_cfg pass will stack overflow due to the recursive nature of the code implementing the pass (there is another stack overflow when trying to print the post-inlining SSA too). Increasing the stack limit with ulimit was enough to work around this.

I've (unsuccessfully) attempted to make a minimal repro test case that triggers the same mem2reg behavior, but I believe the root cause of this problem might be some combination of the following:

The mem2reg pass conservatively clones all the state associated with basic blocks and their references/alias sets for every new block, this state is never pruned until the optimization pass is over, and the memory usage continues to grow as more blocks are processed.
The noir-bigint library uses arrays for storing the bigints, which have particular semantics in mem2reg that might be causing it to generate larger-than usual alias sets for the basic blocks. (this might not be an issue and just a misunderstanding on my part, though there are mentions of it in the hackmd about mem2reg)
The bigint library also has a large amount of functions that contain if blocks (in particular, multiplication has an if block that happens inside a doubly-nested for loop), which compile to a very large amount of basic blocks when many multiplications are performed. All of those basic blocks end up with many references to individual array elements too.
The EC functions require a large of amount of multiplications to be performed, for example for exponentiation and pairing operations, which leads to an exponential blow up in the number of basic blocks after inlining and loop unrolling is performed.

The big unknown here is that it's unclear to me whether reducing the memory usage would be enough by itself for the compiled program to be actually efficient when compared to the 0.9.0 codebase. I'm not particularly familiar with how compilation worked in version 0.9.0 with the pre-SSA backend (please enlighten me if you can), but the fact that working around the mem2reg memory issue seemed to still take a very long time to finish evaluation/proving even after the ACIR compilation finished makes me wonder if the resulting program really is just too big for SSA compilation because of too many variables/loops/functions like @jfecher said.

It's also unclear to me which particular/specific source code pattern is leading to inefficient SSA to be generated. If we could isolate a specific problematic code pattern that leads to these issues in the SSA compilation we might be able to rewrite the bigint library and remove these patterns, if that is even possible. (For example, is it worth trying to write a bigint library that does not use arrays and removes all ifs we can from basic operations?)

TomAFrench commented 8 months ago

I've taken a look at your library and noticed a couple of places where we can reduce the size of the circuit as well. I don't think this would fix this issue by itself but it will help:

https://github.com/onurinanc/noir-bls-signature/blob/510a24f6caebd033e5eea293b92981fe0a08dc6c/src/bls12_381/g2.nr#L100-L141

In this function you've added special behaviour for if one of the inputs is equal to zero or the inputs are equal to each other. When generating a circuit this should actually increase the cost of the function as we will need to generate constraints for all potential codepaths.

TomAFrench commented 8 months ago

The bigint library also has a large amount of functions that contain if blocks (in particular, multiplication has an if block that happens inside a doubly-nested for loop), which compile to a very large amount of basic blocks when many multiplications are performed. All of those basic blocks end up with many references to individual array elements too.

~Seems like this could be simplified to~

let k_mod = k as u56 % NUM_LIMBS as u56
let (n, c) = utils::mac(hi.limbs[k_mod], self.limbs[i], other.limbs[j], carry);
hi.limbs[k_mod] = n;
carry = c;

~This shouldn't affect the final circuit size as both the if-statement and modulo are optimized away during compilation but it could ease the memory consumption.~

I just noticed that the different branches act on hi and lo as well so the above snippet is incorrect.

onurinanc commented 8 months ago

@shuklaayush implemented the above snippet in the noir-bigint library, so we should also add him to this conversation since I am not sure whether this is correct or not since all the prime field tests work correctly.

TomAFrench commented 8 months ago

I think there may be a misunderstanding. The mistake was made by me, thinking that we could remove the if statement.

stefan-nikolov96 commented 8 months ago

@TomAFrench any update on this?

jfecher commented 8 months ago

@stefan-nikolov96 Hi, apologies for the lack of updates. I've started working on this issue today (thanks @wwared for the analysis). I've yet to have a fix for it yet but I have some initial observations of running the mem2reg pass on the test-bls-proving-time repository:

On latest master (on my machine with 32GiB ram) we are able to process 20k blocks before getting OOM killed.
Switching to a shared im::HashMap we're able to process 45k blocks, but then still get OOM killed.
Switching to im::OrdMap bumps us up to 70k blocks before again getting killed.

So switching to shared map implementations improves things but does not completely fix the problem. I'm looking into pruning old entries in the blocks list now.

Edit: For reference the target program has 865k reachable blocks total.

vezenovm commented 8 months ago

@onurinanc

Some extra ideas while we work on improving the amount of memory consumed:

Take a look at using a recursive strategy to reduce the size of the program. Something like having each iteration of the miller_loop be proven and recursively verified.
Look where unconstrained functions could help optimize out complex logic.
Reduce the loop count wherever it is possible or replace loops out right. I know for testing we reduced the number of miller_loop iterations, but this reduces the security of the program. Perhaps we can look into replacing loops in general to help us reduce the amount of blocks that the SSA passes have to process.

TomAFrench commented 8 months ago

I've also noticed there are a few early returns on some methods (e.g. when). I haven't checked to see if they're just short cuts or a proper special case which can't be handled by the general case logic, but if the former then we should remove these.

jfecher commented 8 months ago

In both the mem2reg pass and unrolling pass, roughly 30% of the program's memory is spent here https://github.com/noir-lang/noir/blob/master/compiler/noirc_evaluator/src/ssa/ir/dfg.rs#L329. In the values map but specifically for values that are the result of another instruction. We may be able to save quite a bit of space if we switch to a different representation for ValueId such that we can avoid saving result values in this Vec. E.g. enum ValueID { Normal(OldValueId), Result(InstructionId, /*index:*/u32) }

dndll commented 7 months ago

Same issue here, with https://github.com/dndll/GhostFi/blob/don/prover/circuits/ed25519/src/lib.nr. I wasn't quite able to test this due to Message: index out of bounds: the len is 4290 but the index is 4290.

Although, last night this was taking over 555s on my machine, since @jfecher's work in https://github.com/noir-lang/noir/pull/3572 on v0.19.4 has improved the time it takes to OOM from ~555s to ~6.5s.

noir-lang / noir