tronprotocol / java-tron

Java implementation of the Tron whitepaper
GNU Lesser General Public License v3.0
3.75k stars 1.41k forks source link

Slow precompiles for alt_bn128 curve #5492

Open r0wdy1 opened 1 year ago

r0wdy1 commented 1 year ago

Intro

Timeout restrictions on Tron have previously prevented the ability to deploy certain zkSNARK-based applications. This issue has been identified in the past and the root problem has not yet been addressed ( https://github.com/tronprotocol/java-tron/issues/4311).

In this prs, we directly address this issue and propose an implementation that solves it ( https://github.com/zkBob/java-tron/pull/1,https://github.com/zkBob/zksnark-java-sdk/pull/1 ). By optimizing the bn128 precompiles, zk-based apps and zkSNARKs verification processes become usable on Tron.

It’s important to note that these changes will not only enable the zkBob application on Tron, but also a whole class of applications and protocols based on elliptic curve cryptography and pairings in particular. The possibilities brought by ZKP are not limited to privacy focused protocols, some other examples include zk based auth for AA wallets, light clients, zk based bridges, computational integrity enforcing contracts, and many others.

System information

java-tron version: java -jar FullNode.jar -v OS & Version: Linux Branch: master

Expected behaviour

The verifyProof method of a contract for checking pairing over alt_bn128 curve as a part of Groth16 proof system which includes 6 point addition, 5 point multiplication and pairing check itself gives an appropriate result given a valid input.

Actual behaviour

A contract fails to return the result due to timeout error because of an unoptimized implementation of the bn128 precompiles (Multiplication , Addition , Pairing)

Steps to reproduce the behaviour

****Reproducing on private network****

  1. Create a private network using official docs
  2. Deploy a contract to the network using tronbox
  3. Call the contract using a tron box with any parameters, here’s an example from one of successful transaction on Polygon
let result = await contract.verifyProof(
        [
            "410187940904267791665857625594021464304867517571189176088693585418403627175", 
            "3582932279793408241095145848517164196468279227198151416866003392249976328215", 
            "21702782062704102408149255137659156834040101148294719406156063531394617768752",
            "458964024542843322412024634467191796783627388968596258037504", 
            "1105154793589499054101509063059520053678357711567632030398761789118900390304"
        ], [
            "17319632901362089667297833381115050472198619632678725143485052131880736564716", 
            "20217952180782984327874966839451172866880046964759060634104328470666346490787", 
            "14675072293281706791212610449476963932780136199834552157253791069835791205313", 
            "11523869594945451880407802780789187832894590986555492595351670286111658956891", 
            "3618836117018041122057960110621015024190242066995639123523823804179020820127", 
            "17452347979391447598200144953804022165728450595367847105208004839301667143347", 
            "10758418988168490454233654079316113438289985487422580576512164625214747833303", 
            "16113585132297177719187882899369575704186199591529360947329945124938113973184"
        ]
    ).call();

It can be checked on Polygonscan (https://polygonscan.com/address/0xa86c511832ead78d30ad49711874a9f3a1dfb840#readContract) , with the same arguments: [410187940904267791665857625594021464304867517571189176088693585418403627175,3582932279793408241095145848517164196468279227198151416866003392249976328215,21702782062704102408149255137659156834040101148294719406156063531394617768752,458964024542843322412024634467191796783627388968596258037504,1105154793589499054101509063059520053678357711567632030398761789118900390304] [17319632901362089667297833381115050472198619632678725143485052131880736564716,20217952180782984327874966839451172866880046964759060634104328470666346490787,14675072293281706791212610449476963932780136199834552157253791069835791205313,11523869594945451880407802780789187832894590986555492595351670286111658956891,3618836117018041122057960110621015024190242066995639123523823804179020820127,17452347979391447598200144953804022165728450595367847105208004839301667143347,10758418988168490454233654079316113438289985487422580576512164625214747833303,16113585132297177719187882899369575704186199591529360947329945124938113973184]

We have created a dedicated repo for this particular purpose: https://github.com/zkBob/tron-local . It contains both a script to deploy contract and to call the verifyProof method.

  1. Place FullNode.jar into bin folder
  2. ./scripts/run.sh
  3. ./scripts/deploy.sh
  4. set contract address from deploy output to TRANSFER_VERIFIER_ADDRESS in .env
  5. ./scripts/verify.sh returns class org.tron.core.vm.program.Program$OutOfTimeException : CPU timeout for 'SWAP1' operation executing

Reproducing on testnet The same contract is also deployed on Shasta and returns the same error : https://shasta.tronscan.org/#/contract/TPf7skGEB8h7VPSbd2S1CfxjWSJfPKHpQG/code

Backtrace

org.tron.core.vm.program.Program$OutOfTimeException : CPU timeout for 'SWAP1' operation executing
tomatoishealthy commented 1 year ago

That's a great job! Looking forward to further progress.

@yanghang8612 @Federico2014

yanghang8612 commented 1 year ago

Thanks for your work, we are reviewing the code. Any progress will be synced here.

yuekun0707 commented 1 year ago

Very glad to see this work! I have two questions:

  1. After optimizing the bn128 precompiles, how long will one this transaction cost normaolly?
  2. According to https://github.com/tronprotocol/java-tron/issues/4311, maybe one transaction costs 120ms before optimizing, thus if we increase the maximum time limitation of TVM to longer than 120ms, will this solve the problem here?
CarlChaoCarl commented 1 year ago

Based on some research, we can discuss whether increasing the duration of individual transactions to avoid operation execution timeouts can solve this issue.

r0wdy1 commented 1 year ago

@yuekun0707 @CarlChaoCarl You can see benchmarks here: https://github.com/tronprotocol/java-tron/pull/5507 Copying it here The average time of operations before (Intel(R) Core(TM) i7-10750H CPU, 32 GB RAM):

BN128Addition: 66387 ns
BN128Multiplication: 3553350 ns
BN128Pairing (10 pairs): 101565419 ns

The average time of operations after:

BN128Addition: 11576 ns
BN128Multiplication: 181301 ns
BN128Pairing (10 pairs): 3285601 ns

So the gain is 6x - 30x for different ops, whereas BN128Pairing gains even more when more pairs are checked

It's really hard for us to give any useful and reasonable opinion on possible timeout values, because

  1. it is strongly dependent on the node technical specs
  2. curve operations are used as building blocks for vastly different applications. It seems to me that any proposed value could still become restricting for some applications out there. This PR was not meant to enable just ZkBob protocol but a more broad class of applications based on ZkSnarks

So despite the fact that timeout increase seems to be an easier/faster solution for ZkBob I would say that it should have a lesser priority than optimizations and should be considered as last resort But that of course is left for Tron team to decide.

yuekun0707 commented 1 year ago

@yuekun0707 @CarlChaoCarl You can see benchmarks here: #5507 Copying it here The average time of operations before (Intel(R) Core(TM) i7-10750H CPU, 32 GB RAM):

BN128Addition: 66387 ns
BN128Multiplication: 3553350 ns
BN128Pairing (10 pairs): 101565419 ns

The average time of operations after:

BN128Addition: 11576 ns
BN128Multiplication: 181301 ns
BN128Pairing (10 pairs): 3285601 ns

So the gain is 6x - 30x for different ops, whereas BN128Pairing gains even more when more pairs are checked

It's really hard for us to give any useful and reasonable opinion on possible timeout values, because

  1. it is strongly dependent on the node technical specs
  2. curve operations are used as building blocks for vastly different applications. It seems to me that any proposed value could still become restricting for some applications out there. This PR was not meant to enable just ZkBob protocol but a more broad class of applications based on ZkSnarks

So despite the fact that timeout increase seems to be an easier/faster solution for ZkBob I would say that it should have a lesser priority than optimizations and should be considered as last resort But that of course is left for Tron team to decide.

@r0wdy1 thanks very much for your information, the statistical data shows great improvement! Thus, after the optimization and the test, the limit of the maximum time limitation of TVM which is 80ms now is enough for these Zk transactions mostly? If yes, we realy don't need to increase it recently.

r0wdy1 commented 1 year ago

Thus, after the optimization and the test, the limit of the maximum time limitation of TVM which is 80ms now is enough for these Zk transactions mostly? If yes, we realy don't need to increase it recently.

Yes current timeout would be sufficient to execute multiple (15-20) pairings with a reasonable amount of pairs in a single transaction, whereas even a single one used to cause timeout but that off course depends on particular node specs.

yuekun0707 commented 1 year ago

Thus, after the optimization and the test, the limit of the maximum time limitation of TVM which is 80ms now is enough for these Zk transactions mostly? If yes, we realy don't need to increase it recently.

Yes current timeout would be sufficient to execute multiple (15-20) pairings with a reasonable amount of pairs in a single transaction, whereas even a single one used to cause timeout but that off course depends on particular node specs.

@r0wdy1 Great! Thanks for your replay, I have no more questions~

yuekun0707 commented 1 year ago

@r0wdy1 Hi, the Tron team have modified the maximum execution time of one transaction to 350ms for the NILE Testnet. https://nile.tronscan.org/#/ Cloud you please try the ZK transactions test there?

r0wdy1 commented 1 year ago

@yuekun0707 we're currently having minor issues with deploying our contracts on Nile for some reason, doing our best to fix it asap. we will probably test it later today or tomorrow

r0wdy1 commented 1 year ago

@yuekun0707 we have tested several dozens of transactions , the problem hasn't occurred even once. is it possible to get logs for these hashes

21d484711fb51dcab80f440ebaae53791e630d1a77497527a02b53740d969136
7baced07e2b5d4e09d2f1c9a03a67ac74b48ee93a42d67877834cc5156b4e6b0
6f5f32f55c5e29669be38fb81dc9b04e3e81b4b4324181be6967f75fb417a989
2dca90ef88b5146abf744d701b41bebcfd2d43d8122a30430dafa41d2f432f07
54a14564c3e5f28ef7d374affd40b0c90dfc2ec0da01423bdb1b478d2814e661
8eab410c29d53eb0819bb982fb7abf768e14f487419782b2e399d8a9cf58c521
934dc6be768907f828f1b11e148292550b36ba66a2a941b2d175b94a9b4045e2
732ca40e7c5a083c4401867160e7f71675417101cc07050be490e6a12938ad6d
3ad28a1a9d8f544cec8c46eba9c0384bf49e94efc0502d8d42890cb68223bd36
44595113ed222ee1f66c81bf0c30e929fefb748f1b84da37fc6d8a6a158ccbe6
7b6ad2f8e6032ebc32beb3c191f6b697623a23520db73baf02608f0b16a8163a
9b749e86d68e973b60cf58bfa71c78b86b1fa7a79b4e6208886ec7d6837e1477
91ba4cb7def75a79ee597117ee5aea5d333294cfc8eea25b49412cf5a19f2833
e1af7800e7ed0a3c1bd443725cb72245bc987d544fb7f1824ec9b75f68b2f5f5
b4eb64a8cc753ab295468b33fe9c94459c3658eb541caab23fd0b99bede70417
19dbf9aae8b8addd05f787e9c975c95b69c693566e9fe38508d640423c15dbd5
2761f8990827a2e6e505996ae4af47f5118525ddcfdce23aa887dd9e989ccb2f
59ef07dec00700ea270f47fc10f2d2735198c7a65c62d2b1cc636240bbfe1f40
46ba0ebb86c2dd86f29567a2b9825b51eab8aabc93fd4c0a21e3cc12f89c06a2
4bfb4ecb5ac6fb6f11968791f8605519c37272610b57d9c299d1b28074f4ef4a
a74caedfafe1ad3ac73070659cbb69bfcbe0ad72a6309f34ae8b53be94afb678
b3310d782baf31f0e2499408cc2310ec999c8a58c63fdc4bfb6987042eb534c1
374e3f4a6a1bee62cf94a7139af966a18f4a626c3f8f54d0d36f52fe3b588340

AFAIK there's supposed to be some log message for operations that take more than 300 ms. That could be quite useful to estimate the distribution parameters The risk that we see and that we've mentioned in previous discussions is that operations like that require significant computational resources and it is unclear how it guaranteed that all of SR have at minimal or recommended configuration. We have intentionally tested precompiles using a machine that is roughly 2 times slower than the minimum config and timeouts occurred in almost half of the cases with Nile the situation is obviously better but it's still interesting to see the actual stats

yuekun0707 commented 1 year ago

@r0wdy1 thanks for your tests, we will check the log then~ Cloud you please provide the way to do these tests? We want to do more later to check the performance.

The risk that we see and that we've mentioned in previous discussions is that operations like that require significant computational resources and it is unclear how it guaranteed that all of SR have at minimal or recommended configuration. I will talk with the team about this, confirm the SR's situation~

We have intentionally tested precompiles using a machine that is roughly 2 times slower than the minimum config and timeouts occurred in almost half of the cases This is great! The team is evaluating the risk of adding JNI, maybe after they have a conclusion, we can move forward~

r0wdy1 commented 1 year ago

Cloud you please provide the way to do these tests?

we will probably have an extended guide for public tests but in the meantime here's our staging environment which is a copy of production app. There still could be some glitches but overall it works , you would need tron link some USDT and TRX in order to make a deposit. Every deposit , transfer of withdrawal uses precompiles

image

The history tab shows tx hashes.

We also use a simplified CLI-like version of UI which allows to quickly do the same transaction again or make multiple transactions with a single command

image

It doesn't support tronlink , just injected tronweb provider, so generate a new account with a fresh seed phrase or use some known seed phrase to access an existing account , just be sure to have some USDT and TRX. The easiest way is to use get-address and just send funds directly from your wallet. The TRX balance can be checked using get-balance , USDT balance can be checked using get-token-balance. The unit of account (Wei) is somewhat confusing but it's just result of copy-paste, basically it means the smallest available unit for chosen token/network deposit-shielded ^1 makes a deposit from the wallet , "^" means just "treat every number as a whole number of tokens instead of the minimal unit" , so it just makes a deposit of 1 USDC. get-shielded-balance shows balance in the Pool history shows list of all transactions with links to tronscan

image

Using console it's very easy to make multiple transaction, for example here's how to send 10 transaction to yourself :

  1. 'gen-shielded-address` gives you a zk address inside the pool
  2. transfer-shielded [address] ^0.1 10 makes 10 transfers to the specified address each of 0.1 USDT image
lxcmyf commented 1 year ago

@yuekun0707 we have tested several dozens of transactions , the problem hasn't occurred even once. is it possible to get logs for these hashes

21d484711fb51dcab80f440ebaae53791e630d1a77497527a02b53740d969136
7baced07e2b5d4e09d2f1c9a03a67ac74b48ee93a42d67877834cc5156b4e6b0
6f5f32f55c5e29669be38fb81dc9b04e3e81b4b4324181be6967f75fb417a989
2dca90ef88b5146abf744d701b41bebcfd2d43d8122a30430dafa41d2f432f07
54a14564c3e5f28ef7d374affd40b0c90dfc2ec0da01423bdb1b478d2814e661
8eab410c29d53eb0819bb982fb7abf768e14f487419782b2e399d8a9cf58c521
934dc6be768907f828f1b11e148292550b36ba66a2a941b2d175b94a9b4045e2
732ca40e7c5a083c4401867160e7f71675417101cc07050be490e6a12938ad6d
3ad28a1a9d8f544cec8c46eba9c0384bf49e94efc0502d8d42890cb68223bd36
44595113ed222ee1f66c81bf0c30e929fefb748f1b84da37fc6d8a6a158ccbe6
7b6ad2f8e6032ebc32beb3c191f6b697623a23520db73baf02608f0b16a8163a
9b749e86d68e973b60cf58bfa71c78b86b1fa7a79b4e6208886ec7d6837e1477
91ba4cb7def75a79ee597117ee5aea5d333294cfc8eea25b49412cf5a19f2833
e1af7800e7ed0a3c1bd443725cb72245bc987d544fb7f1824ec9b75f68b2f5f5
b4eb64a8cc753ab295468b33fe9c94459c3658eb541caab23fd0b99bede70417
19dbf9aae8b8addd05f787e9c975c95b69c693566e9fe38508d640423c15dbd5
2761f8990827a2e6e505996ae4af47f5118525ddcfdce23aa887dd9e989ccb2f
59ef07dec00700ea270f47fc10f2d2735198c7a65c62d2b1cc636240bbfe1f40
46ba0ebb86c2dd86f29567a2b9825b51eab8aabc93fd4c0a21e3cc12f89c06a2
4bfb4ecb5ac6fb6f11968791f8605519c37272610b57d9c299d1b28074f4ef4a
a74caedfafe1ad3ac73070659cbb69bfcbe0ad72a6309f34ae8b53be94afb678
b3310d782baf31f0e2499408cc2310ec999c8a58c63fdc4bfb6987042eb534c1
374e3f4a6a1bee62cf94a7139af966a18f4a626c3f8f54d0d36f52fe3b588340

AFAIK there's supposed to be some log message for operations that take more than 300 ms. That could be quite useful to estimate the distribution parameters The risk that we see and that we've mentioned in previous discussions is that operations like that require significant computational resources and it is unclear how it guaranteed that all of SR have at minimal or recommended configuration. We have intentionally tested precompiles using a machine that is roughly 2 times slower than the minimum config and timeouts occurred in almost half of the cases with Nile the situation is obviously better but it's still interesting to see the actual stats

@r0wdy1 How did these timeouts occur? You pulled the Nile branch and deployed a private chain on a low configuration machine, right? Then deploy the contract on this private chain for testing and discover a timeout situation? How long have you set the timeout period for transactions on this private chain?

AllFi commented 1 year ago

Hello, @lxcmyf!

Yes, it is basically correct. The only differences are that: 1) We pulled the develop branch (https://github.com/tronprotocol/java-tron/commit/8de354f8e389dfc989d247818d11d884121d625e) since we started local testing before this comment (https://github.com/tronprotocol/java-tron/issues/5492#issuecomment-1784473064) 2) We modified the code a bit to skip timeout checks

We executed approximately 100 transactions and received the following stats:

If we assume that the timeout is set to 350 ms we can estimate that roughly the half of these transacations will be reverted due to the timeout on this (relatively weak) configuration.

Testing on the Nile looks promising since we didn't catch any timeout yet. But it will be great to understand how close are we to the timeout to grasp the situation completely.

lxcmyf commented 1 year ago

@AllFi Can you send a large number of transactions through your deployed contracts on the current Nile testing network, which is equivalent to pressure testing? Let's observe and see if there will be a timeout situation.

AllFi commented 1 year ago

I sent 60 transactions to our contract. It looks like there weren't any timeout situations. Here are the transactions' hashes:

966af36c4a2a76f594c025da8ed6e923189f51b088dc757f3ac87a884ecf3f7f
793285ae48b0c93345d8fca739df34bd23d50793a9239ebc8c8f42bf71a9517d
5f95772a558020f59287b39f695f29f63fb0e139a89213e745dd15f03ba5ce52
96b79c497049fd0a882e3ea8223f03909e380048be7110932e66290aced40fb3
d8a2135985f00d4ba922c2dea47f43bb8a279af69985c205e8a3aa0f1617ce3e
60cbf9a72cebf391e19893137a9dea0a530231610161c6b6e86c8f905b881649
663b566bd80117c7d8406be05c10ec9c7f009154e27b7a9e1efab41682b05b6e
597d5b355c79f7d15eab14238639c24fdb8365600a101be62a4445049fa1a093
daaa67d9933241dcf20c7e2e972f1cdc2de917b4eac207379e68c4fb3aa0cf5e
e35b36243e39b96912929f67dbf8f370f300a588bb4d8d7e804de8b1742ecaa4
8b7b7f0fda09c4498b126c47dfc21f30466e196c65b021974c8aa6876c735a32
cf5f5751ead35488c00b45172bf3196438f041674d6f6397410442f8523cc19c
8077e97af726259115e965954355a4ede0d8599a1566eed22ffda3714d9129c7
34473ebd65d9da22d41b0a2e502130cd01c6e44cffaefe8c4921e0d617fbd44d
4a52026d534689035dd5eedb3bdaa859f8b52cc2a99b2436772b8a385fe4051d
3cbba7aeb457382b236aaae4474c5b30a24841ad93765bf3b1434dbb582f30ad
a21f693e968a2ad6d81e648ef0faea304008ae9d789f4bb2bb94d707dc9e4038
d98447c30ba9b3a93539fa7d84559e42dce13d287df5accd9f3bfad07a4ad43c
3868b056cd71d6348fd06157e47dda42e766cadf7f63303e916a24ea112b0165
c62bb5af4207b1b6f1babcb65917a75926f9aee8a696d77e5daf2420f59679b1
022d8f9a4df2718ef9a3fc7d40dde4ab6c6b1740ad9534889329915fc09f71c1
4a79f0fc277a488332b3c3f0d6a3b9485eea11789ef328c80fce2733fdb87d79
bd3a9b3a159c2336222f40d7be9a106c7d84028463830071059eb1e2c90d0199
6df4a465a72fcb95153cfcee866f8613beff28509b21b69feff4cae8e6e520f9
9b7d80f6f6e01ef7eabef3e56c89c04b95a6111259479b21e6d131efd20de456
74f3342e7e32e1d24146f54a5d2d9000681a62550487de716292a149ca0e6880
746373cdce37013369de7a6b587949ad41faf6b72af4c82903b660d0c7e4edd2
f55cd80960e9abcaa989296b40829e90046eba8190b396697de8dc9035eef65a
4c541edaf422cbe4a92ad38d9b9e4e5879594e671167b6938d8a76322d9b8f23
a3fd952269d865a1d2bc973f11a8c290ee511e17c97fb6b8a102e696399b3ee9
eca76be7d2bb604cdc85e542a73f18628d81c96d2172beb3ed3d70d0e6d300a8
a5ebf3d1957c677de5ebe22133beaf10ee24966435f80a2c68d368ea4b756339
f269c2d82bab4c8ea7a74731e5fbd7c3f93ba84b5702953ec5bc07c9a84a4c48
762026234b25de544adb36c3d1c7fcaf6f0b67c5c11208345b189c8ec09c8219
4e0adb8e39fe7de2ff6f533688e71dbbd7c69bba13b6affb07a9bea0c69b6e20
dff357334fcb3d6a363099f5e240d0d8b4600798e0180355b2bed3f854ef920a
03c38ac212491efb6bc8d2a9585753c5e5c842b93666da94a9aab0177991b83a
a9431913da4bc9c35a3e8eb69725b4645daead950fb2f0ec71b8ffe80f085180
627439e69a5e901f41d3d6d40c7d5d0d3cd1c4756afdc19decfac09ed23716e7
2d1a578e604adc617b4429ee5300f627de15e1e6a1485955dc12b4bba854848f
ce72a1c07c9adf3f499a10cd2ae528e80bb5e481e0b46d66f64003494826166f
96d884cd0681e897389a784c6fa62a52cec20c2e9051afe330ce36b7c9b72e00
15898c64e2e893495130407ada0be688a2e49410dd56838fd13bd7dc6f1c7b10
4d991113e7c6ed310715ec39183aa24ab79686d83c4a3eb6bc9c10b20cd590b8
47bfa884c457f31ea86e5567e9983932ea76d04a9c2589c63991573e7e44504b
ada1bf11287257c10ed513f167b09f47a616f394fd3471053f41826003ff8d50
d35313f97d2758f7453e540c90c5b22073ea0b3965e433a48e8ea01f03e2045e
ec41c43fff000f1e62d8cf5dab4cda25aab44fed81feef68c5e56d092c61616a
ba94dc92043a40d4907fd4bf10f8b22df9f0641620cbcba5c0d6fca28b601c62
c69dcee48b1f65c1dacf87487e00275930242931c708da17e729b2e143df2432
5c7de412befe49c5fa3b8a366759378a0a6b5be07e2bdce0799cd8a64c954a18
bc2d9af9f13798035bf36a766ee32018c0b9ebd6e2f037d0d2002c628fa87b57
d1d8b6acc388ee712ac673bb9bf40255d5a62562be345f0f61aee2b96baebbc3
111fc2ba352f16a3f2334d88f0be2ccf778f2b02e2056f45728c805c139b40f7
ed8d203c6add940163b556a522c7e758adbf1d83dc5585b3b355eba54b47d63d
2c98a183d6c453663bb569e41774d68e6ac06579e3cc7e27a938b4b4de54e87e
c7ddc0ce00b69bc0cb5f56eaae05c916bb3aa7e392e81365432cdc8d11dc3fc3
fe1a689de1094a721830d8956f4c13ed587116d0c41b4a76c3110e0222b07acc
29414dab24c5687783f7b10858de91661d1b624a9d927c68db89dfa2d49bafc0
190c6376de67db9758d83f77307dcd5bc341c5373c401debcec195c2674c5d2c

Our transactions are quite expensive in terms of energy. 5000 TRX from the faucet is enough for about 20 txs so it is not very convenient. Is it possible to acquire more TRX for testing on this address: TCHtKJnxVULFsQ8RFQHev74a7gsWfgcq4L?

lxcmyf commented 1 year ago

@AllFi Okay, we will recharge this address with 3000000trx, which can send approximately 12000 transactions.

r0wdy1 commented 1 year ago

We've sent 300 more tx to Nile , no timeouts so far

r0wdy1 commented 1 year ago

@yuekun0707 @lxcmyf is there any updated plan or estimates for timeout increase ?

lxcmyf commented 1 year ago

@yuekun0707 @lxcmyf is there any updated plan or estimates for timeout increase ?

@r0wdy1 300 transactions may not be enough. We would like to see a large number of transactions sent through your deployed contracts in a short period of time, such as 10000 transactions, to see the timeout situation under pressure testing, or are you already in testing?

r0wdy1 commented 1 year ago

a large number of transactions sent through your deployed contracts

this is quite hard to do because of all the checks that contracts have and because every transaction needs a proof, but we will try to come up with a solution shortly. At the very least what we can do is to deploy a dummy contract that calls the pairing precompiled contract and some more EC operations

are you already in testing?

what do you mean?

r0wdy1 commented 1 year ago

@lxcmyf

We would like to see a large number of transactions sent through your deployed contracts in a short period of time, such as 10000 transactions

I'm sending more than 1000 transactions in batches of 10 tx every second. I can't send requests more often because of rate limitations on https://nile.trongrid.io endpoint If there are other endpoints that we c

ould use, then we could further increase the load In regards of timeout so far so good. Here's a exported responses from rpc for all of txs results.zip The list of transactions can also be seen on tronscan

If it's possible could you please send some more TRX, since we're currently both doing open beta testing with users and load testing

lxcmyf commented 1 year ago

@lxcmyf

We would like to see a large number of transactions sent through your deployed contracts in a short period of time, such as 10000 transactions

I'm sending more than 1000 transactions in batches of 10 tx every second. I can't send requests more often because of rate limitations on https://nile.trongrid.io endpoint If there are other endpoints that we c

ould use, then we could further increase the load In regards of timeout so far so good. Here's a exported responses from rpc for all of txs results.zip The list of transactions can also be seen on tronscan

If it's possible could you please send some more TRX, since we're currently both doing open beta testing with users and load testing

@r0wdy1 As you mentioned, the address has been recharged with 3000000trx again.

r0wdy1 commented 1 year ago

@lxcmyf thank you. We've launched a load testing script that sends transaction simultaneously from 5 machines You can see here more than 7000 txs: https://nile.tronscan.org/#/contract/TAnCQk9KNvuvuQgEDXxXjyaU8kRiBJNGxw/transactions We had quite a lot of 502 responses from RPC endpoint https://nile.trongrid.io/wallet/broadcasttransaction but nevertheless all of the transactions that were accepted by RPC later were successfully processed

lxcmyf commented 1 year ago

@r0wdy1 It seems that the results are good. We will analyze the packaging of these transactions to see if there are any issues, or do you need further assistance from us?

r0wdy1 commented 1 year ago

The most important questions left for us is Energy consumption for EC operation. Right now it's somewhat expensive: processing just a 100 transaction per day would require more than 3 000 000 TRX staked, and allocating these cost to users would make the service less attractive ( around 40-50 TRX per tx ) . Burning or renting would be even more costly. The adoption of the original optimization PR would solve that obviously, because the actual cost of computation would decrease at least by a factor of 10 and that would justify Energy cost reduction. So our question is : is it somehow possible to reduce the Energy cost for EC operations in advance since we have agreed that the PR is going to be merged somewhere down the road?

r0wdy1 commented 1 year ago

@lxcmyf have you changed the EC precompiles somehow? Now they fail because of timeout https://nile.tronscan.org/#/transaction/bf2eb4e7827fbd8840a84725bbc1186c8eae4640c1c26d908b09785e505813f9

lxcmyf commented 1 year ago

@lxcmyf have you changed the EC precompiles somehow? Now they fail because of timeout https://nile.tronscan.org/#/transaction/bf2eb4e7827fbd8840a84725bbc1186c8eae4640c1c26d908b09785e505813f9

@r0wdy1 This is due to the parameter changes in Proposal #13, Due to temporary testing, it was previously 350ms, but now it has been adjusted back to 160ms.

r0wdy1 commented 1 year ago

When will it be increased on Nile again? We have public betta testing going on, and unfortunately currently nothing works.

And what are the estimates for this proposal to be executed on main net? Can we help it somehow from our side?

lxcmyf commented 1 year ago

When will it be increased on Nile again? We have public betta testing going on, and unfortunately currently nothing works.

And what are the estimates for this proposal to be executed on main net? Can we help it somehow from our side?

@r0wdy1 The proposal value for the main website is currently 80ms. If you want to conduct more tests on the Nile testing net regarding this contract transaction, we will apply to adjust it to a larger value again.

r0wdy1 commented 1 year ago

@lxcmyf , yes we need it increased on Nile.

But what is more important is that with 80 ms timeout and without this PR the EC operations don't work on main net either. The idea behind this issue and the corresponding PR was to make elliptic curve pairings and multiplication work on the main net. After build problems and test coverage were fixed we were told that the PR as it was presented is too risky and that as temporary solution you could increase the timeout ( there was even a similar issue that is almost two years old ). We have also performed some stress testing for the same exact reason - to find out how it would work on the main net. So right now we're waiting for this problem to be fixed on the main net one way or another. Since the timeout is faster and less risky - fine. Could you please tell what is needed to be done in order to pass a similar change proposal on the main net?

lxcmyf commented 1 year ago

@lxcmyf , yes we need it increased on Nile.

But what is more important is that with 80 ms timeout and without this PR the EC operations don't work on main net either. The idea behind this issue and the corresponding PR was to make elliptic curve pairings and multiplication work on the main net. After build problems and test coverage were fixed we were told that the PR as it was presented is too risky and that as temporary solution you could increase the timeout ( there was even a similar issue that is almost two years old ). We have also performed some stress testing for the same exact reason - to find out how it would work on the main net. So right now we're waiting for this problem to be fixed on the main net one way or another. Since the timeout is faster and less risky - fine. Could you please tell what is needed to be done in order to pass a similar change proposal on the main net?

@r0wdy1 At present, the Nie test net can be adjusted to 350ms for your testing, but the main net adjustment timeout is still being evaluated, and it is difficult to adjust it to 300ms or even longer. At this stage, it is unlikely to be affected. In addition, we have a dedicated team to track and evaluate the optimization PR you mentioned, and we will synchronize it in a timely manner if there is progress.

r0wdy1 commented 1 year ago

@lxcmyf Sorry, but lot's of questions below since after almost 3 months of productive cooperation this comment feels like a roadblock, so we need to decide what to do next

it is difficult to adjust it to 300ms or even longer. At this stage, it is unlikely to be affected.

First of all, it's not clear what was the purpose of increasing it on the testnet and why we were supposed to test it following the comment from @yuekun0707 , and did we have to make the stress test

Secondly, is there a reason why it shouldn't be increased on the mainnet? After the stress test there was a comment about literally everything looking good, did anything happened afterwards ?

Can we please get some plan or estimates for decision process on you side?

lxcmyf commented 1 year ago

@lxcmyf Sorry, but lot's of questions below since after almost 3 months of productive cooperation this comment feels like a roadblock, so we need to decide what to do next

it is difficult to adjust it to 300ms or even longer. At this stage, it is unlikely to be affected.

First of all, it's not clear what was the purpose of increasing it on the testnet and why we were supposed to test it following the comment from @yuekun0707 , and did we have to make the stress test

Secondly, is there a reason why it shouldn't be increased on the mainnet? After the stress test there was a comment about literally everything looking good, did anything happened afterwards ?

Can we please get some plan or estimates for decision process on you side?

@r0wdy1 Increasing the timeout time is to successfully execute the contract transactions deployed by you with the minimum impact range. When the transaction timeout time on the Nile testing network was set to 350ms, we analyzed your latest stress test on the Nile testing network and found that each block can package up to 4 contract transactions deployed by you, and according to our observation, this period resulted in a large number of other ordinary transactions being unable to be linked, This may lead to cyber attacks, although your contract transaction fees are already very expensive, compared to the potential attack risks, the attack costs are still too low, and the costs may be further increased to avoid risks. This is also why the main network is unlikely to increase the timeout time.

yuekun0707 commented 1 year ago

@lxcmyf Sorry, but lot's of questions below since after almost 3 months of productive cooperation this comment feels like a roadblock, so we need to decide what to do next

it is difficult to adjust it to 300ms or even longer. At this stage, it is unlikely to be affected.

First of all, it's not clear what was the purpose of increasing it on the testnet and why we were supposed to test it following the comment from @yuekun0707 , and did we have to make the stress test

Secondly, is there a reason why it shouldn't be increased on the mainnet? After the stress test there was a comment about literally everything looking good, did anything happened afterwards ?

Can we please get some plan or estimates for decision process on you side?

hi, @r0wdy1, we have got the data of the test on the nile environment, and will analyze the influence on the chain when large number of these transactions happen. Just like @lxcmyf said, before these analysis work are done, we can't make a decision to modify the timeout time of the main network. About the JNI, because we can't get the progress of the corresponding members now, I want to know whether there are any other possible ways to optimize this issue without JNI?

r0wdy1 commented 1 year ago

About the JNI, because we can't get the progress of the corresponding members now, I want to know whether there are any other possible ways to optimize this issue without JNI?

It's very doubtful that there is an efficient implementation of elliptic curve pairings in pure Java. Wrapping c++ or rust code using JNI is the only approach that I've seen in projects where performance is required

The only optimization with current implementation that we can do is to break our current contract call (which executes two pairing checks) into two transactions , so that each of them would have a single proof verification. That would require a roughly twice less timeout value

There also could be some space for optimization with compilation flags, but I'm not really sure

r0wdy1 commented 1 year ago

@yuekun0707 @lxcmyf We can try to make a low level proof of concept for EC multiplication that doesn't use BigInt. Is there a chance of that code to be merged ? However that is basically rewriting arkworks in Java, which is worse that using arkworks since it's indeed battle tested and audited and comes from a well known team

lxcmyf commented 1 year ago

@r0wdy1 We are currently unable to grasp the progress of the evaluation work on the JNI implementation of this PR. If you can achieve similar results without introducing JNI, that would be the best, and it should be possible to merge the code more smoothly.

r0wdy1 commented 1 year ago

If you can achieve similar results without introducing JNI, that would be the best, and it should be possible to merge the code more smoothly.

I'm afraid JVM features and specifically garbage collection make it impossible to reach the same performance levels like native applications. As I said we can try to implement at least mathematical optimizations, but this is not a small effort to put it mildly, and besides from developing there would be required a very thorough review and audit. The JNI approach is hands down best solution performance wise even with all the overhead that bindings add

lxcmyf commented 1 year ago

Yes, this will not be a simple task.

xiangjie256329 commented 12 months ago

为什么eth不会有这个问题,是不是跑不了zk proof...可以在doc里面给出建议,跑不了zk的智能合约,普通的操作都会超时.

r0wdy1 commented 11 months ago

@xiangjie256329

"Why doesn't Ethereum have this issue? Is it because it cannot run zk proofs? Suggestions can be provided in the documentation for smart contracts that cannot run zk proofs, as normal operations will time out."

This issue is not network specific but rather related to implementation. For example in geth the assembly implementation is preferable ( eg amd64 ) Generally speaking their approach is the same thing as previously suggested JNI PR - offloading math-heavy operations to optimized native code pieces included as block boxes. Of course using assembly code makes code less readable and the places where integration with low level code happens are tricky and need a thorough audit, but this is a reasonable trade-off and the only way to go if you're using high level language for the platform and require a good performance at the same time. High level languages with virtual machine and GC give a lot of benefits for developer experience, robustness, but it comes with a certain cost. The alternative way ditch JVM and GC and use C++ or similar languages for everything would give the required performance but at even higher cost and it's obviously not a viable option

We have modified the original field operation implementation to support Montgomery form in pure Java with approximate gains for pairings around 33% (https://github.com/tronprotocol/java-tron/pull/5611) We also plan to move two pairing checks to different transactions, so we're aiming to reduce required execution time about 3 times (whereas JNI would 30x). This seems to be the only low-hanging fruit , and even with this optimization we would still need the timeout around 120 ms for transactions to pass @lxcmyf , 120ms seems to be a reasonable timeout comparing to 350 ms , don't you think?

lxcmyf commented 11 months ago

@r0wdy1 Thank you and your team for your efforts. It would be great if we could further improve and optimize this timeout. Have you conducted rigorous testing to verify the feasibility of this approach? Is 120ms sufficient to meet your minimum requirement for deploying this contract transaction? We will simultaneously evaluate this improvement.

xiangjie256329 commented 11 months ago

@xiangjie256329

"Why doesn't Ethereum have this issue? Is it because it cannot run zk proofs? Suggestions can be provided in the documentation for smart contracts that cannot run zk proofs, as normal operations will time out."

This issue is not network specific but rather related to implementation. For example in geth the assembly implementation is preferable ( eg amd64 ) Generally speaking their approach is the same thing as previously suggested JNI PR - offloading math-heavy operations to optimized native code pieces included as block boxes. Of course using assembly code makes code less readable and the places where integration with low level code happens are tricky and need a thorough audit, but this is a reasonable trade-off and the only way to go if you're using high level language for the platform and require a good performance at the same time. High level languages with virtual machine and GC give a lot of benefits for developer experience, robustness, but it comes with a certain cost. The alternative way ditch JVM and GC and use C++ or similar languages for everything would give the required performance but at even higher cost and it's obviously not a viable option

We have modified the original field operation implementation to support Montgomery form in pure Java with approximate gains for pairings around 33% (#5611) We also plan to move two pairing checks to different transactions, so we're aiming to reduce required execution time about 3 times (whereas JNI would 30x). This seems to be the only low-hanging fruit , and even with this optimization we would still need the timeout around 120 ms for transactions to pass @lxcmyf , 120ms seems to be a reasonable timeout comparing to 350 ms , don't you think?

超时是不是总是出现在固定的几台机器,是否可以考虑增加节点机器的性能

r0wdy1 commented 11 months ago

Thank you and your team for your efforts. It would be great if we could further improve and optimize this timeout. Have you conducted rigorous testing to verify the feasibility of this approach? Is 120ms sufficient to meet your minimum requirement for deploying this contract transaction? We will simultaneously evaluate this improvement.

This change was made intentionally small, which is why the gains are somewhat limited. Rewriting BigInt could give some more benefits but it's much more challenging and error prone. The only thing that was changed for now is that instead of doing modulo division between squaring/multiplication we do a much cheaper binary right shift. We have used the same test suite from geth as with JNI PR for checking correctness of all of the edge cases and whatnot Yes, we expect 120ms to be enough, but of course testing on Nile will show whether these estimates are correct

r0wdy1 commented 11 months ago

@xiangjie256329

Does timeout always occur on a few individual machines? Is it possible to consider increasing the performance of node machines?

No, it's related to performance of unoptimized code on recommended node configuration. One could spin up some supercomputer that would chew through even that, but this is of course an unviable option. The only option is to optimize the execution and raise timeout appropriately if required so that transactions are processed but spamming doesn't become to easy at the same time

lxcmyf commented 11 months ago

@r0wdy1 OK, we will prepare to adjust it to 120ms on Nile as soon as possible, and you can further test it on Nile at that time. Have PR been formed now? We can take a preliminary look.

r0wdy1 commented 11 months ago

@lxcmyf yes , it's ready now

xiangjie256329 commented 11 months ago

主网有计划增加到120ms吗?不然在nile上测试通过了,主网仍然会有问题

@lxcmyf yes , it's ready now