Closed jaspervdm closed 5 years ago
If think I have experienced this too. Will take a look.
Just out of interest (unrelated to this issue) how are you constructing a tx with many outputs like this? Modified tx creation code in your local wallet?
Asking because there are many non-standard things it would be nice to be able to script up (large txs being one of them, timelocked txs being another example) that don'really fit cleanly into the "regular" wallet command line tool. And I'm thinking a bit about what the best way of doing this might be.
One (bad) option would be to add a create n outputs
flag to grin wallet send
but this feels wrong and the flags will quickly become unmanageable.
As an experiment I recreated the tx building process in python, using libsecp256k1-zkp
@jaspervdm After extensive testing I don't see any problem with the current propagation. Locally testing with 3 nodes seems like the transaction get relayed properly.
Node 1 -> Node 2 -> Node 3
Node 1:
Aug 08 10:30:33.589 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 08 10:30:44.639 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "push-api", identifier: "?.?.?.?" }
Aug 08 10:30:44.787 DEBG Send tx b10f5544 to 127.0.0.1:13414
Node 2:
Aug 08 10:30:44.793 DEBG Received tx b10f5544 from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 08 10:30:44.794 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 08 10:30:52.996 DEBG handle_payload: CompactBlock: 0416aea1
Aug 08 10:30:52.996 DEBG Received compact_block 0416aea1 at 42773 from 127.0.0.1:23414, going to process.
Aug 08 10:30:52.996 DEBG pipe: process_block 0416aea1 at 42773 with 0 inputs, 1 outputs, 1 kernels
Aug 08 10:30:53.372 DEBG pipe: chain head 0416aea1 @ 42773
Aug 08 10:30:53.373 DEBG adapter: block_accepted: 0416aea1
Aug 08 10:30:55.697 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 08 10:30:55.852 DEBG Send tx b10f5544 to 127.0.0.1:23414
Aug 08 10:30:55.853 DEBG Not sending tx b10f5544 to 127.0.0.1:33414 (already seen)
Node 3:
Aug 08 10:30:56.002 DEBG Received tx b10f5544 from TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, going to process.
Aug 08 10:30:56.008 DEBG pool: add_to_pool: b10f5544, kernels - 1, stem? false
Aug 08 10:30:59.223 DEBG handle_payload: CompactBlock: 05c86ec0
Aug 08 10:30:59.223 DEBG Received compact_block 05c86ec0 at 42776 from 198.245.50.26:13414, going to process.
Aug 08 10:30:59.224 DEBG pipe: process_block 05c86ec0 at 42776 with 0 inputs, 1 outputs, 1 kernels
Aug 08 10:30:59.618 DEBG pipe: chain head 05c86ec0 @ 42776
Aug 08 10:30:59.619 DEBG adapter: block_accepted: 05c86ec0
Aug 08 10:31:06.715 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 08 10:31:06.855 DEBG Send tx b10f5544 to 109.74.202.16:13414
Aug 08 10:31:06.856 DEBG Send tx b10f5544 to 198.245.50.26:13414
Aug 08 10:31:06.857 DEBG Send tx b10f5544 to 167.99.228.15:13414
Aug 08 10:31:06.858 DEBG Not sending tx b10f5544 to 127.0.0.1:13414 (already seen)
Node 1 -> Node 2 -> Node 3 and Node 1 -> Node 3 Similar except that Node 3 has:
Aug 08 10:43:27.189 DEBG pool [txpool]: add_to_pool: b10f5544, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }
Aug 08 10:43:27.194 DEBG Transaction b10f5544 rejected: DuplicateCommitment
Maybe it's something related to the mining process. Continuing the investigation,.
There is indeed a broadcasting error which is not reproductible on a local network but happens with remote node. When node 3 sends the transaction to remote node 4 this is what happens:
Aug 08 15:10:06.019 INFO Client X.X.X.X:23414 corrupted, ban (CorruptedData).
This related to #1327 by any chance?
We have aserialization/deserialization rule around max number of inputs/outputs. But there is no corresponding tx validation rule to match.
This would trigger the CorruptedData
that we're seeing here if we exceeded 500 outputs.
@antiochp It's indeed a deserialization error triggered here. But in that case we have 1 input and 332 outputs. https://github.com/mimblewimble/grin/blob/master/p2p/src/protocol.rs#L86 Which makes it extremely weird why it's working on local net and not with a remote node.
Okay after going pretty deep, I've found the that is some kind of corruption during the transport. For this transaction the error happens exactly here: https://github.com/mimblewimble/grin/blob/328d832bd6e3ac33cf4c160c4d78085dafb20adf/core/src/core/transaction.rs#L777
I managed to get the hex dump of the corrupted transaction.
Original transaction: original.txt
Sent transaction: sent.txt
Received transaction: received.txt
There is clearly a difference there not sure why it happens though.
First difference is: from index 264925 to 265024 Sent
42a14c572c2bb1a7e3d4f0ba528f1ca80c69f2200ca721caeac52df18ab996a66536b86c1bd8535a3bd89a4fe07fb5b2332
Received
42a14c572c2bb11ec50f000000000003a2a878c537c95f587dbfea7f0300addc9b8ebddfd997ca5289ba771158f82e71163
Done some digging as well.
We can now reliably and reproducibly generate txs that cause issues via the "multiple change outputs" arg -o 175
for grin wallet send
.
Added some additional logging locally.
This is after creating a tx with 6 inputs and 176 outputs. The logs suggest it started reading the vec of outputs and then failed (seemingly arbitrarily) on the 109th one. Its not an ordering/sorting issue - we not successfully deserializing all 176 outputs.
Aug 21 10:21:31.049 DEBG handle_payload: received tx: msg_len: 126566
Aug 21 10:21:31.225 ERRO read_and_verify_sorted: failed to read item 109, CorruptedData
Aug 21 10:21:31.225 ERRO tx body: read: outputs: CorruptedData
Edit: it is interesting because this tx was actually successfully processed by a mining node and was included in the next block -
https://grinscan.net/block/61047
But for whatever reason my node failed to deserialize it. So some nodes are handling these fine and others are failing.
This is a subsequent tx. Again deserialization of the outputs failed at an arbitrary output in the vec -
Aug 21 10:30:20.555 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:30:20.757 ERRO read_and_verify_sorted: failed to read item 101, CorruptedData
Aug 21 10:30:20.757 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:30:23.558 INFO Client 195.201.38.140:13414 corrupted, ban (CorruptedData).
Aug 21 10:30:32.660 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:30:33.509 ERRO read_and_verify_sorted: failed to read item 101, CorruptedData
Aug 21 10:30:33.509 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:30:33.560 INFO Client 206.189.88.220:13414 corrupted, ban (CorruptedData).
Aug 21 10:30:33.836 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:30:34.587 ERRO read_and_verify_sorted: failed to read item 151, CorruptedData
Aug 21 10:30:34.587 ERRO tx body: read: outputs: CorruptedData
We see this same tx from various peers and it fails at a different output the 3rd time it saw it...
Aug 21 10:57:01.195 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:57:01.499 ERRO read_and_verify_sorted: failed to read item 86, CorruptedData
Aug 21 10:57:01.499 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:57:03.849 INFO Client 167.99.87.54:13414 corrupted, ban (CorruptedData).
Aug 21 10:57:08.261 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:57:10.721 DEBG handle_payload: received tx: msg_len: 132346
Aug 21 10:57:12.570 ERRO read_and_verify_sorted: failed to read item 86, CorruptedData
Aug 21 10:57:12.570 ERRO tx body: read: outputs: CorruptedData
Aug 21 10:57:12.726 DEBG Received tx 4c23be57, inputs: 176, outputs: 176, kernels: 1, going to process.
Aug 21 10:57:13.850 INFO Client 87.117.38.161:13414 corrupted, ban (CorruptedData).
Aug 21 10:57:15.476 DEBG pool [txpool]: add_to_pool: 4c23be57, TxSource { debug_name: "p2p", identifier: "?.?.?.?" }, inputs: 176, outputs: 176, kernels: 1 (at block 039a7cad)
Same tx failing deserialization multiple times from different peers then being successfully processed from a different peer...
So I think I reproduced this locally with a couple of nodes on 127.0.0.1 talking to each other. Same build, same env, same OS (obviously).
Sent a tx with 500 inputs and 1002 outputs.
Node1 -
Aug 23 22:25:04.313 DEBG pool [txpool]: add_to_pool: e654f6cf, TxSource { debug_name: "push-api", identifier: "?.?.?.?" }, inputs: 500, outputs: 1002, kernels: 1 (at block 08ecaa4d)
Aug 23 22:25:04.324 DEBG Send tx e654f6cf to 127.0.0.1:14414
Node2 -
Aug 23 22:25:04.326 DEBG handle_payload: received tx: msg_len: 735604
Aug 23 22:25:04.374 INFO Client 127.0.0.1:13414 corrupted, ban (CorruptedData).
I'm going to try and create an integration test in servers
now to try and reproduce this in a single process. Should be very doable now.
@antiochp was your local test that reproduced it using dandelion?
I'm throwing in a PR https://github.com/mimblewimble/grin/pull/1415 with a test that reproduces the original issue very reliably.
In servers/tests/simulnet.rs
there's a new test (that you need to un-#[ignore] to run) called replicate_tx_fluff_failure
which starts a few wallets and servers, mines a bit, sends a tx from one to the other then checks the recipient got it.
If you change num_change_outputs
in the api.issue_send_tx
call to a lowish number, (up to about 300 on my machine anyhow), the test will pass. If you increase it (try 500), the transaction never gets mined and the test fails... this happens with 100% consistency.
Also note it doesn't matter whether the transaction is fluffed or not (send true
or false
into api.post_tx()
, doesn't make a difference.
Now I haven't seen anything about corrupted transactions in the log.. this may be something local or there may be a few causes here. Would appreciate if others could play with this test as well.
Also recommend adding:
logger.log_to_file = true;
logger.file_log_level = LogLevel::Debug;
into util/src/logger.rs:init_test_logger
cause this gets spammy
No.. just seems like the issue I was having in this test earlier was not running the test long enough to give the transaction pool enough time to validate. I've updated the test to create 1000 outputs, stem the transaction and run for a few minutes... on my machine on a release build the test completes without issue consistently.
@quentinlesceller @antiochp Would you mind running the test as is on your machines to see if it completes as well? This test should pretty closely replicate what happens in the real world, the only difference being that everything is running under the same parent process (and things taking a bit longer as a result).
The test completes on my side. Prior to that I was not able to replicate the issue on my local machine with several node so I suspect it should works locally. I observed the issue with 1 Mac node sending a large outputs tx to an Ubuntu node. It fails 99% of the time. Not sure if this is OS related though...
Closing that one since it's probably not relevant anymore. Will reopen if needed.
I have been experimenting with large transactions for a bit, and I noticed that a transaction that has a large amount of outputs doesn't seem to fluff successfully. For reference I first tried a normal 1->2 transaction, which was mined quickly after submitting. Here are what the logs look like:
You can see that after sending the tx to the other nodes, we receive the same tx multiple times, as expected. Now compare this to when I send a tx with 332 outputs:
The transaction verifies locally but after sending it to a bunch of nodes, none of them ever send it back. I think this is why it never gets mined, for some reason the other nodes don't accept the block. The only way I have found to get the tx in a block is to mine it myself.
I asked @ignopeverell and he suspected the transaction might hit some size limit, but he said everything below 2MB should be fine. The actual transaction is <500kB in size.
The issue can be reproduced by POSTing the not-yet-mined transaction (which can be found here) to your own node at
/v1/pool/push?fluff