o1-labs / o1js

TypeScript framework for zk-SNARKs and zkApps
https://docs.minaprotocol.com/en/zkapps/how-to-write-a-zkapp
Apache License 2.0
514 stars 116 forks source link

process hangs when verifying proof read&deserialized from file #1411

Closed Yoga2015 closed 2 weeks ago

Yoga2015 commented 9 months ago

Original issue at Discord: https://discord.com/channels/484437221055922177/1204071024191082496

All the code at https://github.com/TokeniZK/zkapp-test/tree/main/src tried with o1js version: 0.15.2, 0.15.4, & 0.16.0

Reproduce steps:

I have a demo zkProgram(https://github.com/TokeniZK/zkapp-test/blob/main/src/zkprogram.ts),

step1: npm run step1 (seen at /src/step1.ts) I exec one method & gen proof, then serialize the proof into file: proof.json. and then read&deserialize it from file and verify it and smoothly get the result: True.

However, step2: (seen at /src/step2.ts) npm run step2 I start another process to read the proof from the file on step1(proof.json), deserialize it and try to verify the proof, But the process keep await at verify(*) line !!

Yoga2015 commented 9 months ago

hi, @mitschabaude, could u please suggest me? thx

dfstio commented 9 months ago

Using compile({ forceRecompile: true }) fixes it

Yoga2015 commented 9 months ago

thx!

mitschabaude commented 9 months ago

Do I understand this correctly - you have to do compile({ forceRecompile: true }) every time you're loading the proof from a file?

Then it's still a bug, we do want this to work from cached verification keys. I'll reopen this to track it if you don't mind

dfstio commented 9 months ago

Yes, it is correct. I've created a small repo with my usual settings that work for me in MinaNFT: https://github.com/dfstio/zkapp-test/tree/dfst (branch dfst) based on @Yoga2015 code

@mitschabaude, you can use it for debugging

See also https://discord.com/channels/484437221055922177/1203303039633465344 https://discord.com/channels/484437221055922177/1204071024191082496

SpaceManiac commented 3 months ago

Running in to the same problem with the dynamic verify method. It seems the workaround is to either forceRecompile: true or cache: Cache.None any program at all before attempting to verify proofs. It doesn't have to be the same program you're verifying proofs for.

Single-file repro. Will succeed on first run, hang on second run, and can be fixed by changing false to true on line 4.

import { Bool, Cache, verify, ZkProgram } from 'o1js';

//  v-- Change to `true` to fix the hang, even though NoopProgram != VerifyMeProgram.
if (false) {
  const NoopProgram = ZkProgram({
    name: 'NoopProgram',
    methods: {
      chaff: {
        privateInputs: [],
        async method(b) {}
      }
    }
  });
  await NoopProgram.compile({ cache: Cache.None });
}

const inputs = await (typeof getPreviousProof !== 'undefined' ? getPreviousProof : createProof)();
console.log('verifying...');
console.log('ok?', await verify(inputs.proof, inputs.verificationKey));

async function createProof() {
  const VerifyMeProgram = ZkProgram({
    name: 'NoopProgram',

    methods: {
      proof: {
        privateInputs: [Bool],
        async method(b) {
          b.assertTrue();
        },
      },
    },
  });

  console.log('compiling...');
  const { verificationKey } = await VerifyMeProgram.compile();
  console.log('proving...');
  const proof = (await VerifyMeProgram.proof(Bool(true))).toJSON();
  const data = { verificationKey, proof };
  const fs = await import('fs/promises');
  const url = await import('url');
  await fs.appendFile(url.fileURLToPath(import.meta.url), `async function getPreviousProof() { return ${JSON.stringify(data)}; }`);
  return data;
}
mitschabaude commented 3 months ago

Thanks for the great analysis @SpaceManiac

Seems that 'verify()' relies on some values computed during compilation, and isn't able to get them from a cache. We need to fix this!

45930 commented 3 months ago

@Yoga2015 I cloned your repo zkapp-test and upgraded to o1js 1.6.0 and found that both scripts work

➜  zkapp-test git:(main) ✗ npm run step1

> zkapp-test@1.0.0 step1
> tsc -p tsconfig.json --module esnext && cross-env TEST_ON_BERKELEY=true node --experimental-specifier-resolution=node ./dist/src/step1.js

ProverX_VK: 4245952421456887355009860533252876175165009247779780553775848711375535231266
ok: true
proofStr: {"publicInput":[],"publicOutput":["0"],"maxProofsVerified":0,"proof":"KChzdGF0ZW1lbn.....yMUFGMTEzNzUpKSkpKSkp"}
ok2: true
➜  zkapp-test git:(main) ✗ npm run step2

> zkapp-test@1.0.0 step2
> tsc -p tsconfig.json --module esnext && cross-env TEST_ON_BERKELEY=true node --experimental-specifier-resolution=node ./dist/src/step2.js

ProverX_VK: 4245952421456887355009860533252876175165009247779780553775848711375535231266
str2: {"publicInput":[],"publicOutput":["0"],"maxProofsVerified":0,"proof":"KChzdGF0Z....EY4RjUpKSkpKSkp"}
ok2: true
SpaceManiac commented 3 months ago

I still see the hang on o1js 1.6.0. Stack trace for the record:

global.$std::sys::wasm::locks::futex_condvar::Condvar::wait::h28b31f59c39a18e1 (011a8aae:1926773)
global.$rayon_core::latch::LockLatch::wait_and_reset::h078cf456c369aa7d (011a8aae:1849956)
global.$std::thread::local::LocalKey<T>::with::h2ce6fa65e023e5b9 (011a8aae:1789370)
global.$rayon_core::registry::Registry::in_worker::h1abee179cbaab566 (011a8aae:1397531)
global.$caml_pasta_fq_plonk_proof_batch_verify (011a8aae:1843699)
module.exports.caml_pasta_fq_plonk_proof_batch_verify (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/plonk_wasm.cjs:340)
caml_pasta_fq_plonk_proof_batch_verify (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4577)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:169582)
Promise.then (Unknown Source:0)
deferred_run (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4882)
batch_verify$0 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:169580)
caml_call2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6803)
batch_verify (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:159208)
batch_verify$0 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:159212)
caml_call1 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6801)
_iIy_ (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:349923)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4855)
Promise.then (Unknown Source:0)
deferred_bind (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4853)
caml_call2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6803)
verify_heterogenous (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:349950)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:355112)
caml_call1 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6801)
with_return (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:78372)
verify_promise$0 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:355110)
caml_call2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6803)
verify$2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:431486)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/lib/proof-system/zkprogram.js:318)
withThreadPool (/PROJECT/node_modules/o1js/dist/node/bindings/js/node/node-backend.js:48)
45930 commented 3 months ago

Running Yoga's original example with forceRecompile: false still hangs as well. They must have updated to using forceRecompile: true after that workaround was discovered.

@SpaceManiac in your snippet getPreviousProof is referenced but not defined. Is there a declaration missing?

SpaceManiac commented 3 months ago

createProof() appends its definition to the file after the first run (perhaps too cute...)

45930 commented 2 months ago

Whatever this bug is, it seems deeply lodged in the Pickles code. Here's what I found...

There are 2 cases I looked at (mostly all reproducing what @SpaceManiac has already reported).

  1. Verifying a serialized JSON proof with a serialized verification key without compiling anything first
  2. Verifying a serialized JSON proof with a serialized verification key after compiling a random zk program circuit (not the same circuit that we are verifying).

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

o1js

In the example, we call the verify function exported in https://github.com/o1-labs/o1js/blob/0f57e36ce753c594c1c018ae2d4c677868a11864/src/lib/proof-system/zkprogram.ts#L475

In both cases, the proof we pass is in a string, so we hit the same code path. I've logged the proof and verification key in o1js from each case, and they are the exact same.

ocaml

Entering the Pickles layer, the call to verify happens in https://github.com/o1-labs/o1js-bindings/blob/8e99acc94da892eeb090367844f85d22ecf4f93b/ocaml/lib/pickles_bindings.ml#L713

In each case, we get passed the deserialization step and enter the async code:

  Pickles.Side_loaded.verify_promise ~typ [ (vk, statement, proof) ]
  |> Promise.map ~f:(fun x -> Js.bool (Or_error.is_ok x))
  |> Promise_js_helpers.to_js

That is as far as I can trace it for now.

dfstio commented 2 months ago

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

Would this behavior change if you would call initializeBindings() first?

mitschabaude commented 2 months ago

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

Would this behavior change if you would call initializeBindings() first?

can't be, verify() calls that as well right at the beginning

mitschabaude commented 2 months ago
  • Verifying a serialized JSON proof with a serialized verification key without compiling anything first
  • Verifying a serialized JSON proof with a serialized verification key after compiling a random zk program circuit (not the same circuit that we are verifying).

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

this is great progress! so it doesn't depend on any circuit-specific artifact

That is as far as I can trace it for now.

we will probably only resolve this issue by digging into the Ocaml/Rust verification code and finding out what implicit assumptions it makes on objects that exist

And we have a list of objects that are likely contenders: every large object (i.e. subject to caching instead of always computing on the fly) that is shared among different circuits/programs. Could be the SRS?

dfstio commented 2 months ago

Could be the SRS?

In my example, if I remove the following files from the cache:

wrap-pk-proverx
wrap-pk-proverx.header
wrap-vk-proverx
wrap-vk-proverx.header

verify() runs successfully. If I keep those files in the cache folder, verify() hangs

mitschabaude commented 2 months ago

If I keep those files in the cache folder, verify() hangs

that is when you compile before verifying right? my explanation for that would be that if those are not present, compile() "fully" recompiles, i.e. same behaviour as forceRecompile: true, and in the process also generates those other objects implicitly needed by verify()

dfstio commented 2 months ago

that is when you compile before verifying right?

Yes, that's right. Removing any other files does not affect the behavior; the process still hangs.

45930 commented 2 months ago

we will probably only resolve this issue by digging into the Ocaml/Rust verification code and finding out what implicit assumptions it makes on objects that exist

@mitschabaude it looks like the verification code actually lives in the mina repo. Is that right? For instance, o1js-bindings calls Pickles.Side_loaded.verify_promise, but the implementation of that function is in mina in src/lib/pickles/pickles.ml. So would it make sense to open a new issue in that repo and link back to this one?

mitschabaude commented 2 months ago

@mitschabaude it looks like the verification code actually lives in the mina repo. Is that right? For instance, o1js-bindings calls Pickles.Side_loaded.verify_promise, but the implementation of that function is in mina in src/lib/pickles/pickles.ml. So would it make sense to open a new issue in that repo and link back to this one?

yes it lives in the mina repo! no opinion on where to host the issue (it could also be that it's somehow caused by o1js behaviour of caching the SRS, which lives in the bindings repo

andrewferrone commented 1 month ago

Hey here, is there a workaround for this issue? Hard to tell from the comment thread. Many thanks.

45930 commented 1 month ago

@andrewferrone, the best workaround at the moment is to compile an empty zk program, then use a cached proof from disk.

For instance

const NoopProgram = ZkProgram({
  name: 'NoopProgram',
  methods: {
    chaff: {
      privateInputs: [],
      async method(b) {}
    }
  }
});
await NoopProgram.compile({ cache: Cache.None });
SebastienGllmt commented 1 month ago

This is not a realistic workaround if you're trying to ship an NPM package that others use. I don't want to include a disclaimer in my NPM package that if downloading the package breaks your build, you have to compile an empty zk program to fix it.

45930 commented 1 month ago

Fair enough. We are working on this and prioritizing the fix. The work around should unblock many users, but I agree it’s not a suitable solution for your use case.

mrmr1993 commented 3 weeks ago

We're not pre-generating the langrange basis that we need before running the verifier. When we find it missing, the rust code panics, and WASM treats the panic as a halt. The main thread hangs waiting for the dead thread.

Will post a fix soon.

mrmr1993 commented 3 weeks ago

@45930 Fix here: https://github.com/o1-labs/o1js/pull/1857. Can you chase the right reviewers etc. to get it (and the corresponding PRs for the Mina, bindings, proof-systems submodules) merged?

45930 commented 3 weeks ago

@mrmr1993 I'll do my best :) thanks for pitching in!

45930 commented 2 weeks ago

Hey @SebastienGllmt, @andrewferrone, @kadirchan and others. We just released o1js version 1.9.0 which should fix this issue! Please re-open this issue if it is still not working for you :)