privacy-scaling-explorations / acceleration-program

Accelerate Early Stage Programmable Cryptography Talents
93 stars 7 forks source link

Proposal: Porting real-world ML model(s) into ZK #39

Open only4sim opened 7 months ago

only4sim commented 7 months ago

Open Task RFP for Porting real-world ML model(s) into ZK

Project: Porting real-world ML model(s) into ZK

Executive Summary

Project Details

Preliminary results

Continuous Ranked Probability Score of pruned models.

I used my vacation time during the preparation of the proposal to conduct preliminary experiments and analysis, pruning the Devin Anzelmo's solution, a multi-class XGBoost model with soft labels, which can estimate rainfall amounts accurately. However, the original model was aggregated from five XGBoost models, each including 10,000 decision trees, for a total of 50,000 decision trees. I tried to use Bonsai's zero knowledge prove hardware acceleration. The proof limit is about 6 million, and the proof time is in the range of 7 to 10 minutes. The complexity of this model makes it extremely difficult to implement the proof on a personal laptop. To address this problem, I pruned the model, reducing the number of decision trees for a single model to 300, for a total of 1500 decision trees. As shown in the figure, the loss of Continuous Ranked Probability Score was only 0.00002, less than 1/300.

Qualifications

Administrative Details

Development Roadmap

Overview

Milestone 1: Data Analysis and Model Selection

Milestone 2: ZK Model Implementation

Milestone 3: Integration with Blockchain Oracles

Milestone 4: Performance Evaluation and Final Reporting

Reference

[1] Wendy Kan Alex Kleeman, Lakshmanan V. 2015. How Much Did It Rain? (2015). https://kaggle.com/competitions/how-much-did-it-rain [2] Devin Anzelmo. 2015. First place code. https://www.kaggle.com/c/how-much-did-it-rain/discussion/16260. (2015). Accessed: 2023-11-20. [3] Eli Ben-Sasson, Alessandro Chiesa, Daniel Genkin, Eran Tromer, and Madars Virza. 2013. SNARKs for C: Verifying Program Executions Succinctly and in Zero Knowledge. IACR Cryptol. ePrint Arch. 2013 (2013), 507. [4] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).

only4sim commented 7 months ago

Hey @socathie and @NOOMA-42 ! I've submitted my proposal. Are you available to review it? Looking forward to your feedback and suggestions :)

NOOMA-42 commented 6 months ago

@only4sim Update: Cathie will be able to review around mid April, she's been quite busy these 2 months

Hey @socathie and @NOOMA-42 ! I've submitted my proposal. Are you available to review it? Looking forward to your feedback and suggestions :)

only4sim commented 6 months ago

Hey @NOOMA-42,

Thanks for updating! I know Dr. Cathie is very busy recently and am also looking forward to seeing their new work. Mid April would be fine for me. Please take the time.

Best wishes, Li

@only4sim Update: Cathie will be able to review around mid April, she's been quite busy these 2 months

Hey @socathie and @NOOMA-42 ! I've submitted my proposal. Are you available to review it? Looking forward to your feedback and suggestions :)

socathie commented 5 months ago

Hi @only4sim, sorry for the delayed response. This paper came out shortly after your proposal: https://github.com/Modulus-Labs/Papers/blob/master/remainder-paper.pdf

I think there is no doubt that decision tree/forest is an important type of model to bring onchain. However, Remainder seems to have cracked Decision Forest with GKR, so I'm not too sure if it is worth the effort to rewrite it in other DSLs.

That being said, Remainder is not currently open-source, so there is still space to port an actual real-world model along with oracle data, into zkML - Just wondering if this is something you want to take into account and maybe modify your proposal in light of recent developments.

NOOMA-42 commented 4 months ago

@only4sim Do you have any update

only4sim commented 4 months ago

Hey @socathie and @NOOMA-42,

Thank you very much for your sharing! The paper looks very interesting and it also shows that this direction get more attention. Sorry it took me some time to understand what they do, but of course it's not complete. I feel that the solutions they propose are interesting, especially in terms of complexity optimization and parallelism, but the lack of open source makes it quite difficult for us to use them.

Here's what I'm thinking so far:

  1. One of our main tasks lies in finding scenarios and corresponding solutions where ZKML is suitable for on-chain use. Extension and enhancement of rainfall data through ZKML is a representative sample.
  2. In addition, we want to be able to provide the community with new tools, rather than just solving a specific problem. We can adapt the field in two steps. First, a generalized converter can convert decision forests into different DSLs. Since decision forests do not rely on a large number of nonlinear operations as NNs do, the converter can be relatively well adapted to different DSLs, even if some of the DSLs' functions are simple. In the second step, after Modulus-Labs' work is open-sourced, it is possible to try to use their tools to prove our decision forests about rainfall evaluation, so that we not only can further validate Modulus-Labs' work, but also get a baseline.

Of course, this is just my initial thoughts at the moment. I'm looking forward to your input as an expert in the field.

Best wishes :)

socathie commented 4 months ago

Love the second point. Perhaps an adapter across multiple DSLs would be a good OS contribution

only4sim commented 4 months ago

Love the second point. Perhaps an adapter across multiple DSLs would be a good OS contribution

Hey @socathie! I agree. Thanks for your suggestion, I think it would be great to put more efforts on the forests like model adaptors across different DSLs, which looks exciting.

Looking forward to your guidance and suggestions in the future.