Fuzzing and hardening - Githubissues

koivunej commented 3 years ago

This builds upon #690. Some found fixes have already landed with #695. It adds on top of any fixes in earlier already:

feature strict to interledger-packet for accepting only varuints which have no leading zeroes
use of that interledger-packet/strict in btp, stream at least
var_octet_bytes can now only be usize wide, which should probably be specced lower to match reality
- handling u64::MAX length payloads would really need a streaming approach
the u64::MAX case became apparent in ccp which preallocated the amount of items thus in fuzzing created huge allocations; now ccp and others mostly Vec::new which does have a chance of perf hit
fuzzing in 4 crates

interledger-packet/strict is useful for roundtrip fuzzing where input is first parsed and then converted back into bytes. When interledger-packet/strict is not enabled any additional leading zeros are accepted for varuints. Accepting them but not outputting them when converting the packet back to bytes creates a lot of false panics when fuzzing. strict shouldn't be enabled in any production usage because it did look like the java implementation for example didn't have such strict checks on parsed BigIntegers.

Most of this work came out of #680 where at updating interledger-stream I ran into the bad length check. Fixing the length check right away failed a lot of existing test cases, which got me to do differential fuzzing between two versions. Differential fuzzing of course runs into all of the previously found issues with roundtrip fuzzing. ~I will probably drop the differential fuzzing before making this ready for review.~ Dropped already.

Per crate fuzzing targets and status:

packet: address is good, packet roundtrip added late, immediate roundtrip failure(s), #[ignore] tests
btp: most fuzzed, looking quite good
stream: roundtrip added late, immediate roundtrip failure(s), #[ignore] tests
ccp: roundtripping, might be good

~85% of added lines are caused by 4 added Cargo.lock files.~

koivunej commented 3 years ago

Build failed with the latest checkpoint-y commit but that is ok as work continues. I've removed the btp "differential fuzzing" in 68260fe. The timestamp parsing was found quite lenient while fuzzing the btp, so that'll be the next thing but I'm ready to call this good to go with the ccp huge preallocations fixed. I'll probably complete this work by creating a few issues on:

more fuzzing
fixing the preallocation somehow in ccp and others which currently call Vec::new()

koivunej commented 3 years ago

Hardening doesn't seem to great with test-md failures. Lets see if I can decipher these.

koivunej commented 3 years ago

It would appear that the "xrp-settlement" failed. From the artifact it'd appear that everything went smooth for alice but in node-bob-settlement-engine.log:

$ less node-bob-settlement-engine.log
2021-01-29T09:58:02.213Z settlement-xrp Generated new XRP testnet account: address=rnGBkZRibgCZ6p3rVHHNVj2VmudXe3nedk secret=snscPxz9TQoDXMSN3fBekxaMk1Xt8
2021-01-29T09:58:02.723Z settlement-core Started settlement engine server
2021-01-29T09:58:12.944Z settlement-xrp Received incoming XRP payment: xrp=0.0005 account=5f3024b6-38d7-46aa-b32a-79cda329e30a txHash=19A0316F6C859266EDE0E2E4981A2AFEEB3DEE30622FF75EBF8F5AB4E1464182
2021-01-29T09:58:12.947Z settlement-core Notifying connector to credit settlement: amountToCredit=0.0005 account=5f3024b6-38d7-46aa-b32a-79cda329e30a settlementId=19A0316F6C859266EDE0E2E4981A2AFEEB3DEE30622FF75EBF8F5AB4E1464182
2021-01-29T09:58:12.955Z settlement-core Error: Connector failed to process settlement: amountToCredit=0.0005 account=5f3024b6-38d7-46aa-b32a-79cda329e30a settlementId=19A0316F6C859266EDE0E2E4981A2AFEEB3DEE30622FF75EBF8F5AB4E1464182
2021-01-29T09:58:12.955Z settlement-core Connector credited incoming settlement: leftover=0.0005 credited=0 amountToCredit=0.0005 account=5f3024b6-38d7-46aa-b32a-79cda329e30a settlementId=19A0316F6C859266EDE0E2E4981A2AFEEB3DEE30622FF75EBF8F5AB4E1464182
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 127.0.0.1:6382
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 127.0.0.1:6382
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 127.0.0.1:6382
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)

127.0.0.1:6382 is the redis assigned to bob's settlement engine
- in scripts the redises are shut down first so it explains these ECONNREFUSED

The whole execution log from the log failure:

Testing "xrp-settlement" on source mode. [4/4]
rm: cannot remove '/home/runner/.interledger/bin/*': No such file or directory
Stopping Interledger nodes...
Building interledger.rs... (This may take a couple of minutes)
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s

Starting Redis instances...done
OK
OK
OK
OK

Starting settlement engines...

Starting Interledger nodes...

Failed to spin up settlement engine. Check out your configuration and log files.
Error running markdown file: ./README.md (parsed bash script /tmp/tmp.IqLUsVs1xo)
Waiting for Interledger.rs nodes to start up..............Some tests failed. [3/4]

I do wonder how come the line preceding Waiting for Interledger.rs nodes to start up is after Failed to spin up settlement engine. Also the failed to spin up settlement engine comes right before exit 1. Cannot see any subshell magic in the lib or run-md-test.sh. Maybe the parser? Nope. The parser seems to just leave the necessary parts in and it is executed with bash -O expand_aliases "$TMP_SCRIPT" where $TMP_SCRIPT is tempfile.

Could be that the formatting issues are because of the strange output where \n is used to prefix messages? Itching to get rid of these and just echo and variable substition these.

Need to get these running locally. I guess I have to spin up a vm. The scripts don't really look like they cleanup after themselves... So this could even be interference from previous test but probably not as these have been executing well.

koivunej commented 3 years ago

Trying to bisect this with another branch.

koivunej commented 3 years ago

Ok I guess the test is just flaky? Good three times (2 here, 1 in first step of bisection) in row, next one will be the third. Separating the CI stuff to a new PR.

koivunej commented 3 years ago

Did the hopefully the final rebasing, made sure that the strict is not enabled for the default cargo build -p ilp-node --all-features builds -- enabling it would be in the spriti of OER notes (rfc 0030) but if someone is trying it against other implementations those might not be compatible.

koivunej commented 3 years ago

test failure which #701 would fix. Lets see what's the test-md failure. It was the usual (see https://github.com/interledger-rs/interledger-rs/pull/693#issuecomment-808343941), now we should have packet capture. But the failure doesn't repeat today...

koivunej commented 3 years ago

I'll merge this today evening or tomorrow after re-running it to see if there's some time-specific issue which could reproduce.

koivunej commented 3 years ago

Dropped the Cargo.lock files in the latest push, I cannot see any point in keeping them.

koivunej commented 3 years ago

Test failure is "normal flaky", lets hope for a test-md failure instead. "Normal flaky" error was:

thread 'three_nodes::three_nodes' panicked at 'error binding to 127.0.0.1:40675: error creating server listener: Address already in use (os error 98)', /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/warp-0.2.5/src/server.rs:213:27

This is because the tests do not use :0 but instead hope racing to listen and close a port. This failure happens surprisingly rarely.

koivunej commented 3 years ago

Now two passes after the "normal flaky." Lets try a third.

koivunej commented 3 years ago

Issue has been created, seems that we'll need to find another PR to trigger the occasional test-md failure.

interledger / interledger-rs

Fuzzing and hardening #693