paritytech / subport

Parity Substrate(-based) chains usage and development support
https://docs.substrate.io
Apache License 2.0
83 stars 13 forks source link

Future: Null error in local_testnet mode but not in single-node --dev mode #52

Closed Gauthamastro closed 3 years ago

Gauthamastro commented 3 years ago

I am able to simulate trades from a centralized exchange in development mode (--dev), but it is resulting in Future: Null error when it is tried in local_testnet mode.

Please check this branch for our code: https://github.com/Polkadex-Substrate/Polkadex/tree/feature/testnet-fix-rpc-orderbook-updates

I am using simulate_trades.js in our repo for simulating the trades.

I think that it is related to some checks that is not enabled in --dev mode but enforced in local_testnet mode.

Found Similar Issue:

  1. https://github.com/paritytech/substrate/issues/2308 but this seems to be solved
  2. https://docs.rs/frame-system/2.0.0/src/frame_system/extensions/check_nonce.rs.html#83 not sure whether this is causing any issue.

Please help.

bkchr commented 3 years ago

Can you describe your error better? I don't understand what the exact error is. Thank you.

Gauthamastro commented 3 years ago

Can you describe your error better? I don't understand what the exact error is. Thank you.

I am trying to simulate trades from a centralized exchange in our Polkadex node. I am using the following code snippet to submit transactions to the node.

    api.tx.polkadex.submitOrder("BidLimit", tradingPairID, price_converted, quantity_converted)
                .signAndSend(alice, {nonce: alice_nonce},(status)=>{
                console.log(status.status.toHuman())
            });

Please check this for the complete script that simulates trades from a centralized exchange.

I am submitting about 150-200 transactions per second to a node. It was working and executing all the transactions sent to the node when the node is started using the --dev command, but when I create a local testnet and simulate the trades using the same node js code. It is showing Future::Null after maybe like 3 mins or so.

I think there is some safety check enforced in local testnet mode and not in --dev mode, which is not allowing to submit a large number of valid but transactions from a single account.

danforbes commented 3 years ago

I was able to reproduce this behavior locally, but I am still unsure as to whether or not it constitutes unexpected behavior or a bug. First observation: consider upgrading to v2.0.0 as soon as possible.

It appears that when the node is run in development mode, it can handle a large number of repeated ongoing transactions without any type of problem. However, if two nodes are spun up locally to emulate a private network per this tutorial https://substrate.dev/docs/en/tutorials/start-a-private-network/alicebob the node does not seem to be able to handle the same script that sends transactions. After a few minutes, it results in Polkadot-JS API reporting this as the transaction status

{ Dropped: null }
{ Future: null }

@Gauthamastro - can you please confirm that this :point_down: is the question?

Why does the node behave one way when run in single-node development mode and another way when run as a local private network?

My initial thinking is that testing the behavior of a Substrate-based blockchain by observing a locally emulated private network is not really realistic and the outcome will depend on the hardware you're running on and what other processes are running on the machine.

Gauthamastro commented 3 years ago

I was able to reproduce this behavior locally, but I am still unsure as to whether or not it constitutes unexpected behavior or a bug. First observation: consider upgrading to v2.0.0 as soon as possible.

It appears that when the node is run in development mode, it can handle a large number of repeated ongoing transactions without any type of problem. However, if two nodes are spun up locally to emulate a private network per this tutorial https://substrate.dev/docs/en/tutorials/start-a-private-network/alicebob the node does not seem to be able to handle the same script that sends transactions. After a few minutes, it results in Polkadot-JS API reporting this as the transaction status

{ Dropped: null }
{ Future: null }

@Gauthamastro - can you please confirm that this :point_down: is the question?

Why does the node behave one way when run in single-node development mode and another way when run as a local private network?

My initial thinking is that testing the behavior of a Substrate-based blockchain by observing a locally emulated private network is not really realistic and the outcome will depend on the hardware you're running on and what other processes are running on the machine.

Yes that's is my question. I have tried this with local private network on a single machine and also in two different machines in cloud. Both results gave me the same results.

Gauthamastro commented 3 years ago

@danforbes, I am sharing my research about this issue.

I hope this can give @danforbes some insights into the problem.

jimmychu0807 commented 3 years ago

My initial thinking is that testing the behavior of a Substrate-based blockchain by observing a locally emulated private network is not really realistic and the outcome will depend on the hardware you're running on and what other processes are running on the machine.

This is an interesting and potentially important issue. Let me see if I can use docker to get 3 nodes up running and sending 100 - 200 txs there. Will report back.

This also means that have a 3-node local testnet setup for testing is also a key testing guard.

@Gauthamastro thanks for report, and @danforbes thanks for the investigation

Gauthamastro commented 3 years ago

Thank you, @jimmychu0807. As per @danforbes 's recommendation, I am upgrading the Polkadex Node to Substrate v2.0.0 and using substrate-subxt to submit the transactions. If it is a problem with PolkadotJS API, then the issue will not be there with subxt. I will update here.

Gauthamastro commented 3 years ago

The Validator who is receiving the load from substrate-subxt

020-11-27 15:16:18.009 tokio-runtime-worker INFO sc_consensus_slots  🔖 Pre-sealed block for proposal at 49. Hash now 0xe110f5795c3bbda7a79dd8a3926cd11994c69c667147d924f4e364905533a2a0, previously 0x8b4feb79d03e523451d163687699e873db8da3dcb42df2a173e4e4a3fd7fbdb0.
2020-11-27 15:16:18.010 tokio-runtime-worker INFO substrate  ✨ Imported #49 (0xe110…a2a0)
2020-11-27 15:16:21.642 tokio-runtime-worker INFO substrate  💤 Idle (1 peers), best: #49 (0xe110…a2a0), finalized #47 (0xbc8d…3f8b), ⬇ 0.8kiB/s ⬆ 3.7kiB/s
2020-11-27 15:16:26.642 tokio-runtime-worker INFO substrate  💤 Idle (1 peers), best: #49 (0xe110…a2a0), finalized #47 (0xbc8d…3f8b), ⬇ 1.1kiB/s ⬆ 27.1kiB/s
2020-11-27 15:16:30.325 tokio-runtime-worker INFO substrate  ✨ Imported #50 (0x38c7…c9e4)
2020-11-27 15:16:30.330 tokio-runtime-worker INFO sc_basic_authorship::basic_authorship  🙌 Starting consensus session on top of parent 0xe110f5795c3bbda7a79dd8a3926cd11994c69c667147d924f4e364905533a2a0
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool  [0xb4ebe80aa75f76a459bb6fe49e681bca6294b28000146b094a821043d6f408d8] Extrinsic invalid
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool  [0x397977a06e28040fba41406b220987ea81c921106b8dba20b8c48b2e2b9beebe] Extrinsic invalid
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool  [0xa534e6e46d365d58e97e4e302d1f445320a4eaf581b3686de9be5f6c73a8152b] Extrinsic invalid
2020-11-27 15:16:30.585 tokio-blocking-driver WARN txpool  [0x2404fe918e406dafa0b1b12ac7e76026253d3a25fc4891bdad680ab01843e75f] Extrinsic invalid

From the 2nd Validator in the Network

 INFO sc_basic_authorship::basic_authorship  🙌 Starting consensus session on top of parent 0xe110f5795c3bbda7a79dd8a3926cd11994c69c667147d924f4e364905533a2a0
 INFO sc_basic_authorship::basic_authorship  🎁 Prepared block for proposing at 50 [hash: 0xe84eae3cabe44faf3063eb8b38af0b4d9e753c6905f91324811aab769df428e5; 
parent_hash: 0xe110…a2a0; extrinsics (681): [0x658f…4426, 0xbcdf…9e75, 0xe08a…40e9, 0xdff6…9e8b, 0xda8d…0e39, 0xcdb7…b823, 0xc78b…de8e, 0x27fa…a7cb, 0x315f…fa22, 0x7e1c…34c7, 0x9d8b…7c30, 0xa11a…9880, 0x5951…60f0, 0xd6d4…e67a, 

Block time is 6 seconds. I also noted, when I tested with just "cargo build" in debug mode, "block builder failed to create the block in time". These errors are also there.

Gauthamastro commented 3 years ago

Does anyone know why and when an Extrinsic is considered invalid?

stiiifff commented 3 years ago

@Gauthamastro I had a look at your runtime.

Block time: 2 seconds MaximumBlockWeight: 2 / 3 seconds (~ 0.667 s) In principle, this follows the rule of thumb of block weight < (block time / 2). But still, these are rather short block time and max block weight. Remember, a whole lot has to happen during the time expressed by the max block weight (block authoring, validating, network propagation / syncing).

Below is a weight breakdown for a 6-second target block time, and 2-second max block weight:

97110893-ed4fa200-16db-11eb-83f3-2c3a9038edcc-2

Your runtime's parameters

As a comparison, see Polkadot & Kusama's runtime parameters. https://guide.kusama.network/docs/en/kusama-parameters https://wiki.polkadot.network/docs/en/maintain-polkadot-parameters

Could you re-try your test case with a 6 seconds target block time, 2 seconds Maximum block weight ? (same as in Polkadot & Kusama) ? And then (assuming tx are being processed properly) and only then, tune those params down with careful testing at every step.

Gauthamastro commented 3 years ago

Hi @stiiifff , Thanks for your review. I have tested it with 2, 3, 6, and even 12-second block time following the same rule, but the error is still there. I also tried to clone the latest substrate node template (Aura) and 2-second block time; it works!! At least it didn't fail for the last 10 hrs( or 700K transactions).

I am not exactly sure whether it's a problem with BABE or something else is wrong.

danforbes commented 3 years ago

@jimmychu0807 have you been able to investigate this issue further? Do you have any suggestions for other people that we could involve if this is, in fact, an important issue?

bkchr commented 3 years ago

Did I read this correctly that the node is still using pre2.0. How old?

We had some fixes for the transaction pool, which could may fix the problems described here.

Gauthamastro commented 3 years ago

@bkchr Yes, I have updated the code to Substrate 2.0. Is it possible to get a substrate node template with BABE installed to check whether I did something wrong while converting AURA to BABE while developing Polkadex?

bkchr commented 3 years ago

The default Substrate node is using BABE.

Gauthamastro commented 3 years ago

@bkchr I didn't understand. I am not aware of any BABE based Substrate node template. this uses Aura.

Are you referring to this one?

bkchr commented 3 years ago

Are you referring to this one?

Yes

Gauthamastro commented 3 years ago

Are you referring to this one?

Yes

Sure, I will try that.

danforbes commented 3 years ago

Hi, @Gauthamastro - do you have an update on this Issue?

Gauthamastro commented 3 years ago

@danforbes Unfortunately, I couldn't get time to work on this issue. I will update here by the weekend. I am closing this issue for now

jimmychu0807 commented 3 years ago

May I keep this issue opened for another two weeks. Actually this is issue is queued in my todo list to investigate...

danforbes commented 3 years ago

Thank you, @jimmychu0807 🙏🏻

Gauthamastro commented 3 years ago

@jimmychu0807 @danforbes @bkchr I have tried with this. The issue is still there. Tried 3s and 6s block times. Both didn't work.

The code I used to test Polkadex Node- Substrate Master load-generator script

Gauthamastro commented 3 years ago

I have reduced the number of transactions, and it was working, but as soon as TPS is increased, nodes started crashing.

The original load was around 20-30ish TPS, but anything under 10 TPS was working fine. Also, note this issue is not present in AURA.

Gauthamastro commented 3 years ago

@jimmychu0807 @danforbes Is there any updates on this?

jimmychu0807 commented 3 years ago

@Gauthamastro I am going thru this issue today. Sorry for my late reply

Gauthamastro commented 3 years ago

@Gauthamastro I am going thru this issue today. Sorry for my late reply

Thanks, @jimmychu0807. I have updated Polkadex to track Substrate master. You can access the code we used for testing here. Similarly, the script used for load generation here

There is a chance that the node might panic in that issue is being tracked here: https://github.com/paritytech/substrate/issues/7794.

Please use --no-mdns to avoid panicking.

jimmychu0807 commented 3 years ago

@gautamdhameja I wrote a small script and test submitting balances.transfer tx to Substrate node at various frequency. I am using Substrate node, so using Babe, not substrate-node-template.

I find that even if I tweak it up 200/s (practically it is about 30 tx/s unless I add batched txs logic there), I find Substrate gladly accepting them, putting them in block, and finalizing them.

Could you try it first with balances.transfer, and then switching over to your pallet extrinsic and send txs to Substrate. If it fail, then maybe it is something specific to the pallet extrinsic.

Let me know how it goes. Thanks

jimmychu0807 commented 3 years ago

@Gauthamastro Rereading your msg, when you say you run in local_testnet mode, do you start Substrate as ./target/<debug, release>/substrate --tmp like this? Or please paste the cmd and chain_spec.json on how you start the node. Thanks.

Gauthamastro commented 3 years ago

@jimmychu0807 I will try again with the new script.

@Gauthamastro Rereading your msg, when you say you run in local_testnet mode, do you start Substrate as ./target/<debug, release>/substrate --tmp like this? Or please paste the cmd and chain_spec.json on how you start the node. Thanks.

../target/release/polkadex-mainnet \
  --base-path /tmp/alice \
  --chain customSpecRaw.json \
  --alice \
  --port 30333 \
  --ws-port 9944 \
  --node-key 0000000000000000000000000000000000000000000000000000000000000001 \
  --validator

Also, Chain_Spec is created using the following commands

../target/release/polkadex-mainnet build-spec --disable-default-bootnode --chain local > customSpec.json
../target/release/polkadex-mainnet build-spec --chain=customSpec.json --raw --disable-default-bootnode > customSpecRaw.json
jimmychu0807 commented 3 years ago

@Gauthamastro I tried with 3 nodes setup running with --chain local, using substrate in master branch this version, able to use my txsubmitter to send about 20 - 30tx/s (~720 balances.transfer txs in 40s) and able to run fine. Let me know how it goes on your side (first try with balances.transfer and then your own pallet call).

Gauthamastro commented 3 years ago

@Gauthamastro I tried with 3 nodes setup running with --chain local, using substrate in master branch this version, able to use my txsubmitter to send about 20 - 30tx/s (~720 balances.transfer txs in 40s) and able to run fine. Let me know how it goes on your side (first try with balances.transfer and then your own pallet call).

Hey @jimmychu0807, Sorry for being late. We tried your script, it's working because it stops loading the blockchain with the new transactions after the given time period.

We tried replacing Polkadex pallet with Balances pallet for load testing and it failed again like before.

let alice_nonce = 0;
    binance.websockets.trades(['BTCUSDT'], (trade) => {
    // We had Polkadex Submit order here earlier but instead we replaced with Balances transfer and it's failing
                       api.tx.balances.transfer(bob.address, 123).signAndSend(alice, {nonce: alice_nonce});
                       alice_nonce = alice_nonce + 1;

    });

Here is a short screen capture of it failing.

https://user-images.githubusercontent.com/17110878/107154466-13523700-6999-11eb-8515-6fd5562be6b6.mp4

Gauthamastro commented 3 years ago

@jimmychu0807 @bkchr @danforbes Also, Please note that this issue is not present in AURA with the same configuration for both Polkadex and Balances pallet. It looks super weird to me. Please let me know how we can approach this issue and fix it!

bkchr commented 3 years ago

@Gauthamastro can you check the log if it contains "reorg" around the time it starts to reject extrinsics?

Gauthamastro commented 3 years ago

@Gauthamastro can you check the log if it contains "reorg" around the time it starts to reject extrinsics?

No, it didn't do a chain re-org Here are the screenshots of logs of the two failing validators out of 3 just before failing.

Validator Receiving the load Screenshot from 2021-02-09 11-55-30 Another validator who failed almost the same time as the first one Screenshot from 2021-02-09 11-54-15

I modified the script provided by @jimmychu0807. Here it is

// Initialize Binance
const Binance = require('node-binance-api');
const binance = new Binance().options({
    APIKEY: '<key>',
    APISECRET: '<secret>'
});
const { ApiPromise, WsProvider, Keyring } = require('@polkadot/api');
// Substrate connection config
const WEB_SOCKET = 'ws://localhost:9944';
const TYPES = {};
// This script will wait for n secs before stopping itself
const LASTING_SECS = 30;
const ALICE = '//Alice';
const BOB = '//Bob';
// This is sending txs / sec, the least is 1 tx/s
const TX_FREQUENCY = 20;
// This is 1 Unit
const TX_AMT = 1000000000000000;
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
const connectSubstrate = async () => {
  const wsProvider = new WsProvider(WEB_SOCKET);
  const api = await ApiPromise.create({ provider: wsProvider, types: TYPES });
  return api;
};
// This function returns a tx unsubcription handler
const submitTx = async (api, src, dest, amt, txCnt, nonce) =>
    await api.tx.balances.transfer(dest.address, amt)
        .signAndSend(src, { nonce }, res => {
            console.log(`Tx ${txCnt} status: ${res.status}`);
        });
const main = async () => {
    const api = await connectSubstrate();
    const keyring = new Keyring({ type: 'sr25519' });
    console.log('Connected to Substrate');
    let txCnt = 0;
    let nonce = 0;
    const alice = keyring.addFromUri(ALICE);
    const bob = keyring.addFromUri(BOB);
//  setInterval(() => {
//      txCnt += 1;
//      submitTx(api, alice, bob, TX_AMT, txCnt, nonce);
//      nonce += 1;
//  }, 1000/TX_FREQUENCY);
//  await sleep(LASTING_SECS * 1000);
        let alice_nonce = 0;
    binance.websockets.trades(['BTCUSDT'], (trade) => {
        api.tx.balances.transfer(bob.address, 123).signAndSend(alice, {nonce: alice_nonce});
        alice_nonce = alice_nonce + 1;
    });
    await sleep(LASTING_SECS * 1000000);
};
main()
    .then(() => {
        console.log("successfully exited");
        process.exit(0);
    })
    .catch(err => {
        console.log('error occur:', err);
        process.exit(1);
    })
bkchr commented 3 years ago

Validator Receiving the load

This logs stops directly at the relevant point. As you see it imports block 51 and 52 a second time, probably build by someone else. That a reorg happened, isn't printed directly. I think that we are doing a reorg and hitting this: https://github.com/paritytech/substrate/issues/5139

Gauthamastro commented 3 years ago

Validator Receiving the load

This logs stops directly at the relevant point. As you see it imports block 51 and 52 a second time, probably build by someone else. That a reorg happened, isn't printed directly. I think that we are doing a reorg and hitting this: paritytech/substrate#5139

Ohh yeah. I missed that reorg. I was looking for a reorg log to be printed. Since the https://github.com/paritytech/substrate/issues/5139 is not fixed yet. Is there any solution other than changing the way we load the blockchain with transactions?

I think sending the transactions from multiple accounts can solve this issue. Please correct me if I am wrong.

bkchr commented 3 years ago

If you send it from multiple accounts, it should work for now.

Gauthamastro commented 3 years ago

If you send it from multiple accounts, it should work for now.

Alright. Thanks!

nuke-web3 commented 3 years ago

This is a great resource to reference for the future here. But I think it's ok to close? please reopen if you guys are still actively working on it 🙏🏼