mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 991 forks source link

Fluffing a transaction with large number of outputs #1325

Closed jaspervdm closed 5 years ago

jaspervdm commented 6 years ago

I have been experimenting with large transactions for a bit, and I noticed that a transaction that has a large amount of outputs doesn't seem to fluff successfully. For reference I first tried a normal 1->2 transaction, which was mined quickly after submitting. Here are what the logs look like:

Aug 07 21:23:52.737 INFO Pushing transaction with 1 inputs and 2 outputs to pool.
Aug 07 21:23:52.737 DEBG pool: add_to_pool: cdef9cce, kernels - 1, stem? false
Aug 07 21:23:53.025 DEBG pool [txpool]: add_to_pool: cdef9cce, TxSource { debug_name: "push-api", identifier: "?.?.?.?" }
Aug 07 21:23:53.031 DEBG Send tx cdef9cce to 45.61.156.122:13414
Aug 07 21:23:53.031 DEBG Send tx cdef9cce to 185.53.158.12:13414
Aug 07 21:23:53.031 DEBG Send tx cdef9cce to 195.201.38.140:13414
Aug 07 21:23:53.032 DEBG Send tx cdef9cce to 178.62.100.159:13414
Aug 07 21:23:53.032 DEBG Send tx cdef9cce to 87.117.38.161:13414
Aug 07 21:23:53.032 DEBG Send tx cdef9cce to 128.199.165.106:13414
Aug 07 21:23:53.033 DEBG Send tx cdef9cce to 108.196.200.233:13414
Aug 07 21:23:53.033 DEBG Send tx cdef9cce to 94.130.229.193:13414
Aug 07 21:23:53.229 DEBG Received tx cdef9cce from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 07 21:23:53.230 DEBG pool: add_to_pool: cdef9cce, kernels - 1, stem? false
Aug 07 21:23:53.336 DEBG Received tx cdef9cce from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 07 21:23:53.349 DEBG Received tx cdef9cce from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 07 21:23:53.350 DEBG Received tx cdef9cce from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 07 21:23:53.351 DEBG Received tx cdef9cce from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 07 21:23:53.515 DEBG pool [txpool]: add_to_pool: cdef9cce, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 07 21:23:53.520 DEBG Transaction cdef9cce rejected: DuplicateCommitment
Aug 07 21:23:53.520 DEBG pool: add_to_pool: cdef9cce, kernels - 1, stem? false
Aug 07 21:23:53.752 DEBG pool [txpool]: add_to_pool: cdef9cce, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 07 21:23:53.757 DEBG Transaction cdef9cce rejected: DuplicateCommitment
Aug 07 21:23:53.757 DEBG pool: add_to_pool: cdef9cce, kernels - 1, stem? false
Aug 07 21:23:53.989 DEBG pool [txpool]: add_to_pool: cdef9cce, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 07 21:23:53.994 DEBG Transaction cdef9cce rejected: DuplicateCommitment
Aug 07 21:23:53.994 DEBG pool: add_to_pool: cdef9cce, kernels - 1, stem? false
Aug 07 21:23:54.227 DEBG pool [txpool]: add_to_pool: cdef9cce, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 07 21:23:54.231 DEBG Transaction cdef9cce rejected: DuplicateCommitment
Aug 07 21:23:54.232 DEBG pool: add_to_pool: cdef9cce, kernels - 1, stem? false
Aug 07 21:23:54.462 DEBG pool [txpool]: add_to_pool: cdef9cce, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 07 21:23:54.466 DEBG Transaction cdef9cce rejected: DuplicateCommitment

You can see that after sending the tx to the other nodes, we receive the same tx multiple times, as expected. Now compare this to when I send a tx with 332 outputs:

Aug 07 21:36:15.108 INFO Pushing transaction with 1 inputs and 332 outputs to pool.
Aug 07 21:36:15.120 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 07 21:36:52.605 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "push-api", identifier: "?.?.?.?" }
Aug 07 21:36:53.189 DEBG Send tx b10f5544 to 104.209.194.130:13414
Aug 07 21:36:53.231 DEBG Send tx b10f5544 to 165.227.109.134:13414
Aug 07 21:36:53.269 DEBG Send tx b10f5544 to 198.245.50.26:13414
Aug 07 21:36:53.303 DEBG Send tx b10f5544 to 178.62.100.159:13414
Aug 07 21:36:53.338 DEBG Send tx b10f5544 to 87.122.44.225:13414
Aug 07 21:36:53.372 DEBG Send tx b10f5544 to 95.90.125.204:13414
Aug 07 21:36:53.407 DEBG Send tx b10f5544 to 89.101.91.246:13414
Aug 07 21:36:53.442 DEBG Send tx b10f5544 to 66.23.200.117:13414

The transaction verifies locally but after sending it to a bunch of nodes, none of them ever send it back. I think this is why it never gets mined, for some reason the other nodes don't accept the block. The only way I have found to get the tx in a block is to mine it myself.

I asked @ignopeverell and he suspected the transaction might hit some size limit, but he said everything below 2MB should be fine. The actual transaction is <500kB in size.

The issue can be reproduced by POSTing the not-yet-mined transaction (which can be found here) to your own node at /v1/pool/push?fluff

quentinlesceller commented 6 years ago

If think I have experienced this too. Will take a look.

antiochp commented 6 years ago

Just out of interest (unrelated to this issue) how are you constructing a tx with many outputs like this? Modified tx creation code in your local wallet?

Asking because there are many non-standard things it would be nice to be able to script up (large txs being one of them, timelocked txs being another example) that don'really fit cleanly into the "regular" wallet command line tool. And I'm thinking a bit about what the best way of doing this might be.

One (bad) option would be to add a create n outputs flag to grin wallet send but this feels wrong and the flags will quickly become unmanageable.

jaspervdm commented 6 years ago

As an experiment I recreated the tx building process in python, using libsecp256k1-zkp

quentinlesceller commented 6 years ago

@jaspervdm After extensive testing I don't see any problem with the current propagation. Locally testing with 3 nodes seems like the transaction get relayed properly.

Configuration 1:

Node 1 -> Node 2 -> Node 3

Node 1:

Aug 08 10:30:33.589 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 08 10:30:44.639 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "push-api", identifier: "?.?.?.?" }
Aug 08 10:30:44.787 DEBG Send tx b10f5544 to 127.0.0.1:13414

Node 2:

Aug 08 10:30:44.793 DEBG Received tx b10f5544 from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 08 10:30:44.794 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 08 10:30:52.996 DEBG handle_payload: CompactBlock: 0416aea1
Aug 08 10:30:52.996 DEBG Received compact_block 0416aea1 at 42773 from 127.0.0.1:23414, going to process.
Aug 08 10:30:52.996 DEBG pipe: process_block 0416aea1 at 42773 with 0 inputs, 1 outputs, 1 kernels
Aug 08 10:30:53.372 DEBG pipe: chain head 0416aea1 @ 42773
Aug 08 10:30:53.373 DEBG adapter: block_accepted: 0416aea1
Aug 08 10:30:55.697 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 08 10:30:55.852 DEBG Send tx b10f5544 to 127.0.0.1:23414
Aug 08 10:30:55.853 DEBG Not sending tx b10f5544 to 127.0.0.1:33414 (already seen)

Node 3:

Aug 08 10:30:56.002 DEBG Received tx b10f5544 from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 08 10:30:56.008 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 08 10:30:59.223 DEBG handle_payload: CompactBlock: 05c86ec0
Aug 08 10:30:59.223 DEBG Received compact_block 05c86ec0 at 42776 from 198.245.50.26:13414, going to process.
Aug 08 10:30:59.224 DEBG pipe: process_block 05c86ec0 at 42776 with 0 inputs, 1 outputs, 1 kernels
Aug 08 10:30:59.618 DEBG pipe: chain head 05c86ec0 @ 42776
Aug 08 10:30:59.619 DEBG adapter: block_accepted: 05c86ec0
Aug 08 10:31:06.715 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 08 10:31:06.855 DEBG Send tx b10f5544 to 109.74.202.16:13414
Aug 08 10:31:06.856 DEBG Send tx b10f5544 to 198.245.50.26:13414
Aug 08 10:31:06.857 DEBG Send tx b10f5544 to 167.99.228.15:13414
Aug 08 10:31:06.858 DEBG Not sending tx b10f5544 to 127.0.0.1:13414 (already seen)

Configuration 2

Node 1 -> Node 2 -> Node 3 and Node 1 -> Node 3 Similar except that Node 3 has:

Aug 08 10:43:27.189 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 08 10:43:27.194 DEBG Transaction b10f5544 rejected: DuplicateCommitment

Maybe it's something related to the mining process. Continuing the investigation,.

quentinlesceller commented 6 years ago

There is indeed a broadcasting error which is not reproductible on a local network but happens with remote node. When node 3 sends the transaction to remote node 4 this is what happens:

Aug 08 15:10:06.019 INFO Client X.X.X.X:23414 corrupted, ban (CorruptedData).

antiochp commented 6 years ago

This related to #1327 by any chance?

https://github.com/mimblewimble/grin/blob/328d832bd6e3ac33cf4c160c4d78085dafb20adf/core/src/core/transaction.rs#L302-L307

We have aserialization/deserialization rule around max number of inputs/outputs. But there is no corresponding tx validation rule to match.

This would trigger the CorruptedData that we're seeing here if we exceeded 500 outputs.

quentinlesceller commented 6 years ago

@antiochp It's indeed a deserialization error triggered here. But in that case we have 1 input and 332 outputs. https://github.com/mimblewimble/grin/blob/master/p2p/src/protocol.rs#L86 Which makes it extremely weird why it's working on local net and not with a remote node.

quentinlesceller commented 6 years ago

Okay after going pretty deep, I've found the that is some kind of corruption during the transport. For this transaction the error happens exactly here: https://github.com/mimblewimble/grin/blob/328d832bd6e3ac33cf4c160c4d78085dafb20adf/core/src/core/transaction.rs#L777

I managed to get the hex dump of the corrupted transaction.

Original transaction: original.txt

Sent transaction: sent.txt

Received transaction: received.txt

There is clearly a difference there not sure why it happens though.

quentinlesceller commented 6 years ago

First difference is: from index 264925 to 265024 Sent

42a14c572c2bb1a7e3d4f0ba528f1ca80c69f2200ca721caeac52df18ab996a66536b86c1bd8535a3bd89a4fe07fb5b2332

Received

42a14c572c2bb11ec50f000000000003a2a878c537c95f587dbfea7f0300addc9b8ebddfd997ca5289ba771158f82e71163

antiochp commented 6 years ago

Done some digging as well. We can now reliably and reproducibly generate txs that cause issues via the "multiple change outputs" arg -o 175 for grin wallet send.

Added some additional logging locally.

This is after creating a tx with 6 inputs and 176 outputs. The logs suggest it started reading the vec of outputs and then failed (seemingly arbitrarily) on the 109th one. Its not an ordering/sorting issue - we not successfully deserializing all 176 outputs.

Aug 21 10:21:31.049 DEBG handle_payload: received tx: msg_len: 126566
Aug 21 10:21:31.225 ERRO read_and_verify_sorted: failed to read item 109, CorruptedData
Aug 21 10:21:31.225 ERRO tx body: read: outputs: CorruptedData

Edit: it is interesting because this tx was actually successfully processed by a mining node and was included in the next block -

https://grinscan.net/block/61047

But for whatever reason my node failed to deserialize it. So some nodes are handling these fine and others are failing.

antiochp commented 6 years ago

This is a subsequent tx. Again deserialization of the outputs failed at an arbitrary output in the vec -

Aug 21 10:30:20.555 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:30:20.757 ERRO read_and_verify_sorted: failed to read item 101, CorruptedData
Aug 21 10:30:20.757 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:30:23.558 INFO Client 195.201.38.140:13414 corrupted, ban (CorruptedData).
Aug 21 10:30:32.660 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:30:33.509 ERRO read_and_verify_sorted: failed to read item 101, CorruptedData
Aug 21 10:30:33.509 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:30:33.560 INFO Client 206.189.88.220:13414 corrupted, ban (CorruptedData).
Aug 21 10:30:33.836 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:30:34.587 ERRO read_and_verify_sorted: failed to read item 151, CorruptedData
Aug 21 10:30:34.587 ERRO tx body: read: outputs: CorruptedData

We see this same tx from various peers and it fails at a different output the 3rd time it saw it...

antiochp commented 6 years ago
Aug 21 10:57:01.195 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:57:01.499 ERRO read_and_verify_sorted: failed to read item 86, CorruptedData
Aug 21 10:57:01.499 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:57:03.849 INFO Client 167.99.87.54:13414 corrupted, ban (CorruptedData).
Aug 21 10:57:08.261 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:57:10.721 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:57:12.570 ERRO read_and_verify_sorted: failed to read item 86, CorruptedData
Aug 21 10:57:12.570 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:57:12.726 DEBG Received tx 4c23be57, inputs: 176, outputs: 176, kernels: 1, going to process.
Aug 21 10:57:13.850 INFO Client 87.117.38.161:13414 corrupted, ban (CorruptedData).
Aug 21 10:57:15.476 DEBG pool [txpool]: add_to_pool: 4c23be57, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, inputs: 176, outputs: 176, kernels: 1 (at block 039a7cad)

Same tx failing deserialization multiple times from different peers then being successfully processed from a different peer...

antiochp commented 6 years ago

So I think I reproduced this locally with a couple of nodes on 127.0.0.1 talking to each other. Same build, same env, same OS (obviously).

Sent a tx with 500 inputs and 1002 outputs.

Node1 -

Aug 23 22:25:04.313 DEBG pool [txpool]: add_to_pool: e654f6cf, TxSource { debug_name: "push-api", identifier: "?.?.?.?" }, inputs: 500, outputs: 1002, kernels: 1 (at block 08ecaa4d)
Aug 23 22:25:04.324 DEBG Send tx e654f6cf to 127.0.0.1:14414

Node2 -

Aug 23 22:25:04.326 DEBG handle_payload: received tx: msg_len: 735604
Aug 23 22:25:04.374 INFO Client 127.0.0.1:13414 corrupted, ban (CorruptedData).
yeastplume commented 6 years ago

I'm going to try and create an integration test in servers now to try and reproduce this in a single process. Should be very doable now.

yeastplume commented 6 years ago

@antiochp was your local test that reproduced it using dandelion?

yeastplume commented 6 years ago

I'm throwing in a PR https://github.com/mimblewimble/grin/pull/1415 with a test that reproduces the original issue very reliably.

In servers/tests/simulnet.rs there's a new test (that you need to un-#[ignore] to run) called replicate_tx_fluff_failure which starts a few wallets and servers, mines a bit, sends a tx from one to the other then checks the recipient got it.

If you change num_change_outputs in the api.issue_send_tx call to a lowish number, (up to about 300 on my machine anyhow), the test will pass. If you increase it (try 500), the transaction never gets mined and the test fails... this happens with 100% consistency.

Also note it doesn't matter whether the transaction is fluffed or not (send true or false into api.post_tx(), doesn't make a difference.

Now I haven't seen anything about corrupted transactions in the log.. this may be something local or there may be a few causes here. Would appreciate if others could play with this test as well.

Also recommend adding:

logger.log_to_file = true;
logger.file_log_level = LogLevel::Debug;

into util/src/logger.rs:init_test_logger cause this gets spammy

yeastplume commented 6 years ago

No.. just seems like the issue I was having in this test earlier was not running the test long enough to give the transaction pool enough time to validate. I've updated the test to create 1000 outputs, stem the transaction and run for a few minutes... on my machine on a release build the test completes without issue consistently.

@quentinlesceller @antiochp Would you mind running the test as is on your machines to see if it completes as well? This test should pretty closely replicate what happens in the real world, the only difference being that everything is running under the same parent process (and things taking a bit longer as a result).

quentinlesceller commented 6 years ago

The test completes on my side. Prior to that I was not able to replicate the issue on my local machine with several node so I suspect it should works locally. I observed the issue with 1 Mac node sending a large outputs tx to an Ubuntu node. It fails 99% of the time. Not sure if this is OS related though...

quentinlesceller commented 5 years ago

Closing that one since it's probably not relevant anymore. Will reopen if needed.