Closed Gauthamastro closed 3 years ago
Can you describe your error better? I don't understand what the exact error is. Thank you.
Can you describe your error better? I don't understand what the exact error is. Thank you.
I am trying to simulate trades from a centralized exchange in our Polkadex node. I am using the following code snippet to submit transactions to the node.
api.tx.polkadex.submitOrder("BidLimit", tradingPairID, price_converted, quantity_converted)
.signAndSend(alice, {nonce: alice_nonce},(status)=>{
console.log(status.status.toHuman())
});
Please check this for the complete script that simulates trades from a centralized exchange.
I am submitting about 150-200 transactions per second to a node. It was working and executing all the transactions sent to the node when the node is started using the --dev command, but when I create a local testnet and simulate the trades using the same node js code. It is showing Future::Null after maybe like 3 mins or so.
I think there is some safety check enforced in local testnet mode and not in --dev mode, which is not allowing to submit a large number of valid but transactions from a single account.
I was able to reproduce this behavior locally, but I am still unsure as to whether or not it constitutes unexpected behavior or a bug. First observation: consider upgrading to v2.0.0 as soon as possible.
It appears that when the node is run in development mode, it can handle a large number of repeated ongoing transactions without any type of problem. However, if two nodes are spun up locally to emulate a private network per this tutorial https://substrate.dev/docs/en/tutorials/start-a-private-network/alicebob the node does not seem to be able to handle the same script that sends transactions. After a few minutes, it results in Polkadot-JS API reporting this as the transaction status
{ Dropped: null }
{ Future: null }
@Gauthamastro - can you please confirm that this :point_down: is the question?
Why does the node behave one way when run in single-node development mode and another way when run as a local private network?
My initial thinking is that testing the behavior of a Substrate-based blockchain by observing a locally emulated private network is not really realistic and the outcome will depend on the hardware you're running on and what other processes are running on the machine.
I was able to reproduce this behavior locally, but I am still unsure as to whether or not it constitutes unexpected behavior or a bug. First observation: consider upgrading to v2.0.0 as soon as possible.
It appears that when the node is run in development mode, it can handle a large number of repeated ongoing transactions without any type of problem. However, if two nodes are spun up locally to emulate a private network per this tutorial https://substrate.dev/docs/en/tutorials/start-a-private-network/alicebob the node does not seem to be able to handle the same script that sends transactions. After a few minutes, it results in Polkadot-JS API reporting this as the transaction status
{ Dropped: null } { Future: null }
@Gauthamastro - can you please confirm that this :point_down: is the question?
Why does the node behave one way when run in single-node development mode and another way when run as a local private network?
My initial thinking is that testing the behavior of a Substrate-based blockchain by observing a locally emulated private network is not really realistic and the outcome will depend on the hardware you're running on and what other processes are running on the machine.
Yes that's is my question. I have tried this with local private network on a single machine and also in two different machines in cloud. Both results gave me the same results.
@danforbes, I am sharing my research about this issue.
Increasing block time or decreasing it doesn't have any effect.
checking the number of pending extrinsics in the pool reveals it's only about 45-90 sometimes, and mostly it's less than 40, but we are sending more than 100-150 transactions per second. My guess, some check is dropping some transactions before it gets to the pool.
CPU and Memory usage during trials coast around 6-10% usage for CPU and 4-6% for Memory on AWS c4.xlarge instances, so most probably machine's power has nothing to do with this error.
Runtime is not having any problem with executing the load in single-node development node but only in private network. It is not dropping a single transaction in --dev mode.
In this issue https://github.com/paritytech/substrate/issues/2308 it seems like there was limit for number of transaction from a single account in a single block, but it is now removed. I tried to get the latest code CheckNonce.
In my script, I am manually incrementing the nonce value so this problem can occur if PolkadotJS API is dropping one of the transactions so now all transactions after that dropped one is tagged Future by the node. This can be one of the plausible origins of this issue. I am probing into it.
I hope this can give @danforbes some insights into the problem.
My initial thinking is that testing the behavior of a Substrate-based blockchain by observing a locally emulated private network is not really realistic and the outcome will depend on the hardware you're running on and what other processes are running on the machine.
This is an interesting and potentially important issue. Let me see if I can use docker to get 3 nodes up running and sending 100 - 200 txs there. Will report back.
This also means that have a 3-node local testnet setup for testing is also a key testing guard.
@Gauthamastro thanks for report, and @danforbes thanks for the investigation
Thank you, @jimmychu0807. As per @danforbes 's recommendation, I am upgrading the Polkadex Node to Substrate v2.0.0 and using substrate-subxt to submit the transactions. If it is a problem with PolkadotJS API, then the issue will not be there with subxt. I will update here.
The Validator who is receiving the load from substrate-subxt
020-11-27 15:16:18.009 tokio-runtime-worker INFO sc_consensus_slots 🔖 Pre-sealed block for proposal at 49. Hash now 0xe110f5795c3bbda7a79dd8a3926cd11994c69c667147d924f4e364905533a2a0, previously 0x8b4feb79d03e523451d163687699e873db8da3dcb42df2a173e4e4a3fd7fbdb0.
2020-11-27 15:16:18.010 tokio-runtime-worker INFO substrate ✨ Imported #49 (0xe110…a2a0)
2020-11-27 15:16:21.642 tokio-runtime-worker INFO substrate 💤 Idle (1 peers), best: #49 (0xe110…a2a0), finalized #47 (0xbc8d…3f8b), ⬇ 0.8kiB/s ⬆ 3.7kiB/s
2020-11-27 15:16:26.642 tokio-runtime-worker INFO substrate 💤 Idle (1 peers), best: #49 (0xe110…a2a0), finalized #47 (0xbc8d…3f8b), ⬇ 1.1kiB/s ⬆ 27.1kiB/s
2020-11-27 15:16:30.325 tokio-runtime-worker INFO substrate ✨ Imported #50 (0x38c7…c9e4)
2020-11-27 15:16:30.330 tokio-runtime-worker INFO sc_basic_authorship::basic_authorship 🙌 Starting consensus session on top of parent 0xe110f5795c3bbda7a79dd8a3926cd11994c69c667147d924f4e364905533a2a0
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool [0xb4ebe80aa75f76a459bb6fe49e681bca6294b28000146b094a821043d6f408d8] Extrinsic invalid
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool [0x397977a06e28040fba41406b220987ea81c921106b8dba20b8c48b2e2b9beebe] Extrinsic invalid
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool [0xa534e6e46d365d58e97e4e302d1f445320a4eaf581b3686de9be5f6c73a8152b] Extrinsic invalid
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool [0x2404fe918e406dafa0b1b12ac7e76026253d3a25fc4891bdad680ab01843e75f] Extrinsic invalid
From the 2nd Validator in the Network
INFO sc_basic_authorship::basic_authorship 🙌 Starting consensus session on top of parent 0xe110f5795c3bbda7a79dd8a3926cd11994c69c667147d924f4e364905533a2a0
INFO sc_basic_authorship::basic_authorship 🎁 Prepared block for proposing at 50 [hash: 0xe84eae3cabe44faf3063eb8b38af0b4d9e753c6905f91324811aab769df428e5;
parent_hash: 0xe110…a2a0; extrinsics (681): [0x658f…4426, 0xbcdf…9e75, 0xe08a…40e9, 0xdff6…9e8b, 0xda8d…0e39, 0xcdb7…b823, 0xc78b…de8e, 0x27fa…a7cb, 0x315f…fa22, 0x7e1c…34c7, 0x9d8b…7c30, 0xa11a…9880, 0x5951…60f0, 0xd6d4…e67a,
Block time is 6 seconds. I also noted, when I tested with just "cargo build" in debug mode, "block builder failed to create the block in time". These errors are also there.
Does anyone know why and when an Extrinsic is considered invalid?
@Gauthamastro I had a look at your runtime.
Block time: 2 seconds
MaximumBlockWeight: 2 / 3 seconds (~ 0.667 s)
In principle, this follows the rule of thumb of block weight < (block time / 2)
.
But still, these are rather short block time and max block weight.
Remember, a whole lot has to happen during the time expressed by the max block weight (block authoring, validating, network propagation / syncing).
Below is a weight breakdown for a 6-second target block time, and 2-second max block weight:
Your runtime's parameters
As a comparison, see Polkadot & Kusama's runtime parameters. https://guide.kusama.network/docs/en/kusama-parameters https://wiki.polkadot.network/docs/en/maintain-polkadot-parameters
Could you re-try your test case with a 6 seconds target block time, 2 seconds Maximum block weight ? (same as in Polkadot & Kusama) ? And then (assuming tx are being processed properly) and only then, tune those params down with careful testing at every step.
Hi @stiiifff , Thanks for your review. I have tested it with 2, 3, 6, and even 12-second block time following the same rule, but the error is still there. I also tried to clone the latest substrate node template (Aura) and 2-second block time; it works!! At least it didn't fail for the last 10 hrs( or 700K transactions).
I am not exactly sure whether it's a problem with BABE or something else is wrong.
@jimmychu0807 have you been able to investigate this issue further? Do you have any suggestions for other people that we could involve if this is, in fact, an important issue?
Did I read this correctly that the node is still using pre2.0. How old?
We had some fixes for the transaction pool, which could may fix the problems described here.
@bkchr Yes, I have updated the code to Substrate 2.0. Is it possible to get a substrate node template with BABE installed to check whether I did something wrong while converting AURA to BABE while developing Polkadex?
The default Substrate node is using BABE.
Are you referring to this one?
Yes
Sure, I will try that.
Hi, @Gauthamastro - do you have an update on this Issue?
@danforbes Unfortunately, I couldn't get time to work on this issue. I will update here by the weekend. I am closing this issue for now
May I keep this issue opened for another two weeks. Actually this is issue is queued in my todo list to investigate...
Thank you, @jimmychu0807 🙏🏻
@jimmychu0807 @danforbes @bkchr I have tried with this. The issue is still there. Tried 3s and 6s block times. Both didn't work.
The code I used to test Polkadex Node- Substrate Master load-generator script
I have reduced the number of transactions, and it was working, but as soon as TPS is increased, nodes started crashing.
The original load was around 20-30ish TPS, but anything under 10 TPS was working fine. Also, note this issue is not present in AURA.
@jimmychu0807 @danforbes Is there any updates on this?
@Gauthamastro I am going thru this issue today. Sorry for my late reply
@Gauthamastro I am going thru this issue today. Sorry for my late reply
Thanks, @jimmychu0807. I have updated Polkadex to track Substrate master. You can access the code we used for testing here. Similarly, the script used for load generation here
There is a chance that the node might panic in that issue is being tracked here: https://github.com/paritytech/substrate/issues/7794.
Please use --no-mdns to avoid panicking.
@gautamdhameja I wrote a small script and test submitting balances.transfer
tx to Substrate node at various frequency. I am using Substrate node, so using Babe, not substrate-node-template.
I find that even if I tweak it up 200/s (practically it is about 30 tx/s unless I add batched txs logic there), I find Substrate gladly accepting them, putting them in block, and finalizing them.
Could you try it first with balances.transfer
, and then switching over to your pallet extrinsic and send txs to Substrate. If it fail, then maybe it is something specific to the pallet extrinsic.
Let me know how it goes. Thanks
@Gauthamastro Rereading your msg, when you say you run in local_testnet mode, do you start Substrate as ./target/<debug, release>/substrate --tmp
like this? Or please paste the cmd and chain_spec.json on how you start the node. Thanks.
@jimmychu0807 I will try again with the new script.
@Gauthamastro Rereading your msg, when you say you run in local_testnet mode, do you start Substrate as
./target/<debug, release>/substrate --tmp
like this? Or please paste the cmd and chain_spec.json on how you start the node. Thanks.
../target/release/polkadex-mainnet \
--base-path /tmp/alice \
--chain customSpecRaw.json \
--alice \
--port 30333 \
--ws-port 9944 \
--node-key 0000000000000000000000000000000000000000000000000000000000000001 \
--validator
Also, Chain_Spec is created using the following commands
../target/release/polkadex-mainnet build-spec --disable-default-bootnode --chain local > customSpec.json
../target/release/polkadex-mainnet build-spec --chain=customSpec.json --raw --disable-default-bootnode > customSpecRaw.json
@Gauthamastro I tried with 3 nodes setup running with --chain local
, using substrate in master branch this version, able to use my txsubmitter to send about 20 - 30tx/s (~720 balances.transfer
txs in 40s) and able to run fine. Let me know how it goes on your side (first try with balances.transfer
and then your own pallet call).
@Gauthamastro I tried with 3 nodes setup running with
--chain local
, using substrate in master branch this version, able to use my txsubmitter to send about 20 - 30tx/s (~720balances.transfer
txs in 40s) and able to run fine. Let me know how it goes on your side (first try withbalances.transfer
and then your own pallet call).
Hey @jimmychu0807, Sorry for being late. We tried your script, it's working because it stops loading the blockchain with the new transactions after the given time period.
We tried replacing Polkadex pallet with Balances pallet for load testing and it failed again like before.
let alice_nonce = 0;
binance.websockets.trades(['BTCUSDT'], (trade) => {
// We had Polkadex Submit order here earlier but instead we replaced with Balances transfer and it's failing
api.tx.balances.transfer(bob.address, 123).signAndSend(alice, {nonce: alice_nonce});
alice_nonce = alice_nonce + 1;
});
Here is a short screen capture of it failing.
@jimmychu0807 @bkchr @danforbes Also, Please note that this issue is not present in AURA with the same configuration for both Polkadex and Balances pallet. It looks super weird to me. Please let me know how we can approach this issue and fix it!
@Gauthamastro can you check the log if it contains "reorg" around the time it starts to reject extrinsics?
@Gauthamastro can you check the log if it contains "reorg" around the time it starts to reject extrinsics?
No, it didn't do a chain re-org Here are the screenshots of logs of the two failing validators out of 3 just before failing.
Validator Receiving the load Another validator who failed almost the same time as the first one
I modified the script provided by @jimmychu0807. Here it is
// Initialize Binance
const Binance = require('node-binance-api');
const binance = new Binance().options({
APIKEY: '<key>',
APISECRET: '<secret>'
});
const { ApiPromise, WsProvider, Keyring } = require('@polkadot/api');
// Substrate connection config
const WEB_SOCKET = 'ws://localhost:9944';
const TYPES = {};
// This script will wait for n secs before stopping itself
const LASTING_SECS = 30;
const ALICE = '//Alice';
const BOB = '//Bob';
// This is sending txs / sec, the least is 1 tx/s
const TX_FREQUENCY = 20;
// This is 1 Unit
const TX_AMT = 1000000000000000;
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
const connectSubstrate = async () => {
const wsProvider = new WsProvider(WEB_SOCKET);
const api = await ApiPromise.create({ provider: wsProvider, types: TYPES });
return api;
};
// This function returns a tx unsubcription handler
const submitTx = async (api, src, dest, amt, txCnt, nonce) =>
await api.tx.balances.transfer(dest.address, amt)
.signAndSend(src, { nonce }, res => {
console.log(`Tx ${txCnt} status: ${res.status}`);
});
const main = async () => {
const api = await connectSubstrate();
const keyring = new Keyring({ type: 'sr25519' });
console.log('Connected to Substrate');
let txCnt = 0;
let nonce = 0;
const alice = keyring.addFromUri(ALICE);
const bob = keyring.addFromUri(BOB);
// setInterval(() => {
// txCnt += 1;
// submitTx(api, alice, bob, TX_AMT, txCnt, nonce);
// nonce += 1;
// }, 1000/TX_FREQUENCY);
// await sleep(LASTING_SECS * 1000);
let alice_nonce = 0;
binance.websockets.trades(['BTCUSDT'], (trade) => {
api.tx.balances.transfer(bob.address, 123).signAndSend(alice, {nonce: alice_nonce});
alice_nonce = alice_nonce + 1;
});
await sleep(LASTING_SECS * 1000000);
};
main()
.then(() => {
console.log("successfully exited");
process.exit(0);
})
.catch(err => {
console.log('error occur:', err);
process.exit(1);
})
Validator Receiving the load
This logs stops directly at the relevant point. As you see it imports block 51 and 52 a second time, probably build by someone else. That a reorg happened, isn't printed directly. I think that we are doing a reorg and hitting this: https://github.com/paritytech/substrate/issues/5139
Validator Receiving the load
This logs stops directly at the relevant point. As you see it imports block 51 and 52 a second time, probably build by someone else. That a reorg happened, isn't printed directly. I think that we are doing a reorg and hitting this: paritytech/substrate#5139
Ohh yeah. I missed that reorg. I was looking for a reorg log to be printed. Since the https://github.com/paritytech/substrate/issues/5139 is not fixed yet. Is there any solution other than changing the way we load the blockchain with transactions?
I think sending the transactions from multiple accounts can solve this issue. Please correct me if I am wrong.
If you send it from multiple accounts, it should work for now.
If you send it from multiple accounts, it should work for now.
Alright. Thanks!
This is a great resource to reference for the future here. But I think it's ok to close? please reopen if you guys are still actively working on it 🙏🏼
I am able to simulate trades from a centralized exchange in development mode (--dev), but it is resulting in Future: Null error when it is tried in local_testnet mode.
Please check this branch for our code: https://github.com/Polkadex-Substrate/Polkadex/tree/feature/testnet-fix-rpc-orderbook-updates
I am using simulate_trades.js in our repo for simulating the trades.
I think that it is related to some checks that is not enabled in --dev mode but enforced in local_testnet mode.
Found Similar Issue:
Please help.