Segmentation fault: 11 after connecting outbound channel

JasonCoombs commented 5 years ago

I mentioned this here https://github.com/libbitcoin/libbitcoin-node/issues/383#issuecomment-433738651

The segmentation faults appear to result from some errant or unexpected behavior of particular peers.

19:33:36.127991 INFO [node] 0x700000433000 Header #1442161 [00000000000000ae54fdc39cba6ebf6cc06ac267f447e9ea4c57e40296ed0368] from [176.9.89.217:18333] (16382, 4).
19:33:36.771122 INFO [node] Block #1442161 (00) [00000000000000ae54fdc39cba6ebf6cc06ac267f447e9ea4c57e40296ed0368] 000.000 00.05% 0
19:33:36.874183 INFO [blockchain] 0x700000433000 Organized blocks [1442161-1442161]
19:33:36.907941 INFO [node] 0x700000433000 Header #1442162 [000000000000011b754146ef49781b84564d68cb955817da3f834b12cfe1f912] from [176.9.89.217:18333] (16382, 4).
19:33:37.140944 INFO [node] Block #1442162 (00) [000000000000011b754146ef49781b84564d68cb955817da3f834b12cfe1f912] 000.000 00.05% 0
19:33:37.143022 INFO [blockchain] 0x700000433000 Organized blocks [1442162-1442162]
19:42:12.358115 INFO [node] 0x700000433000 Header #1442163 [00000000000000c8135693d41f41a7da0f40df02540898c6fd1cfe88a964595c] from [176.9.89.217:18333] (16382, 4).
19:42:12.631068 INFO [node] Block #1442163 (00) [00000000000000c8135693d41f41a7da0f40df02540898c6fd1cfe88a964595c] 000.000 00.05% 0
19:42:12.640404 INFO [blockchain] 0x700000433000 Organized blocks [1442163-1442163]
19:48:17.806944 INFO [node] 0x700000433000 Header #1442164 [00000000000001245fb8f6be64c1ce3fe08d2f19048f3f6e6157d1f8303b8c84] from [176.9.89.217:18333] (16382, 4).
19:48:18.354831 INFO [node] Block #1442164 (00) [00000000000001245fb8f6be64c1ce3fe08d2f19048f3f6e6157d1f8303b8c84] 000.000 00.05% 0
19:48:18.386059 INFO [blockchain] 0x700000433000 Organized blocks [1442164-1442164]
20:07:45.554496 INFO [node] 0x700000433000 Header #1442165 [0000000065db2222c4fce30025fd010d20cf682f2c6166975f14c6f3f20d913e] from [176.9.89.217:18333] (16382, 4).
20:07:46.179416 INFO [node] Block #1442165 (00) [0000000065db2222c4fce30025fd010d20cf682f2c6166975f14c6f3f20d913e] 000.000 00.05% 0
20:07:46.237499 INFO [blockchain] 0x700000433000 Organized blocks [1442165-1442165]
20:14:57.552525 INFO [network] Connected outbound channel [52.243.61.218:18333] (1)
Segmentation fault: 11

JasonCoombs commented 5 years ago

Thinking it would be useful to be able to activate a mode in which ONLY blacklisted peers are allowed, for testing and debugging purposes. Does the Bitcoin community operate any known-malicious peers to help detect bugs in new implementations and for regression testing?

JasonCoombs commented 5 years ago

My current list, some of these peers might have caused segmentation faults:

# IP address to disallow as a peer, multiple entries allowed.
blacklist = 62.182.124.202
blacklist = 51.15.113.51
blacklist = 213.207.92.204
blacklist = 107.167.186.147
blacklist = 108.44.193.95
blacklist = 81.169.162.150
blacklist = 142.93.239.160
blacklist = 5.102.147.62
blacklist = 178.128.85.239
blacklist = 144.76.71.57
blacklist = 167.99.219.177
blacklist = 95.216.23.174
# segmentation fault during block sync
blacklist = 52.243.61.218
blacklist = 195.201.110.218
blacklist = 13.124.0.147

JasonCoombs commented 5 years ago

Do we have a way to determine the software and version number of particular peers?

Like a method for nmap to use to discover software version with a pattern signature.

https://nmap.org/book/man-version-detection.html https://nmap.org/book/vscan.html

Shall we add support in libbitcoin for remote version detection?

JasonCoombs commented 5 years ago

Using https://nmap.org/nsedoc/scripts/bitcoin-info.html with https://nmap.org/nsedoc/lib/bitcoin.html

nmap already supports bitcoin user agent remote discovery, but it isn't working for Testnet nodes.

$ nmap -n -p 8333 --script bitcoin-info 75.40.139.202
Starting Nmap 7.70 ( https://nmap.org ) at 2018-11-05 13:51 HST
Nmap scan report for 75.40.139.202
Host is up (0.12s latency).

PORT     STATE SERVICE
8333/tcp open  bitcoin
| bitcoin-info:
|   Timestamp: 2018-11-05T23:51:05
|   Network: main
|   Version: 0.7.0
|   Node Id: 72c8d9b0b72850b8
|   Lastblock: 548938
|_  User Agent: /Satoshi:0.17.99/

Nmap done: 1 IP address (1 host up) scanned in 0.77 seconds

evoskuil commented 5 years ago

If there is a fault it’s not with the peer, it’s with libbitcoin.

JasonCoombs commented 5 years ago

it will help to know which software the peer is running, so the error condition might be forced when debugging.

evoskuil commented 5 years ago

I see from the console output that the node in question is modified. I assume this pertains to logging only, but it may still be problematic. I synced and validated all of mainnet and have encountered no stability problems.

evoskuil commented 5 years ago

Shall we add support in libbitcoin for remote version detection?

This is native to the Bitcoin p2p protocol, and already logged.

evoskuil commented 5 years ago

What makes you think this is related to peer behavior?

JasonCoombs commented 5 years ago

There's clearly something different about the way certain peers communicate that causes a crash immediately after connecting to them. Blacklist the peer and the problem goes away, until a similar peer is encountered.

evoskuil commented 5 years ago

That doesn't imply correlation with the peer. It could be entirely unrelated and the same might result, since the peer may or may not even be selected again. Maybe if you are seeing it repeatedly happen after the same peer connects, but otherwise I wouldn't make any such assumption.

A better way to test this theory is to simply connect only to the presumed problem peer(s). You can do this using manual connections (though that does introduce a small distinction in the code path, though the same is true of adding peers to the blacklist).

JasonCoombs commented 5 years ago

I tried using nmap service discovery with the Nmap Scripting Engine on my local libbitcoin-node but it isn't reporting any details. compare to https://github.com/libbitcoin/libbitcoin-node/issues/392#issuecomment-436049297

$ nmap -n -p 8333 --script bitcoin-info 127.0.0.1
Starting Nmap 7.70 ( https://nmap.org ) at 2018-11-05 13:57 HST
Nmap scan report for 127.0.0.1
Host is up (0.00020s latency).

PORT     STATE SERVICE
8333/tcp open  bitcoin

Nmap done: 1 IP address (1 host up) scanned in 10.31 seconds

JasonCoombs commented 5 years ago

That doesn't imply correlation with the peer.

I started adding peers to my blacklist when I did see a correlation, it was the same IP each time it crashed. Without the blacklist there's no way my Testnet node could have sync'ed current.

JasonCoombs commented 5 years ago

A better way to test this theory is to simply connect only to the problem peer(s). You can do this using manual connections

can you point me to the documentation for initiating a manual connection? thanks.

evoskuil commented 5 years ago

That’s a feature not a bug.

JasonCoombs commented 5 years ago

That’s a feature not a bug.

you mean the inability of nmap to remote-discover that we are libbitcoin

evoskuil commented 5 years ago

It could simply be reorganization from weaker chains. This was failing until I fixed it a couple days ago, so make sure you have latest.

JasonCoombs commented 5 years ago

Shall we add support in libbitcoin for remote version detection?

This is native to the Bitcoin p2p protocol, and already logged.

I meant user agent, not protocol version. do we not want to be discoverable by nmap for some reason?

evoskuil commented 5 years ago

User agent is part of the P2P version message.

Why on earth would we want to add code to advertise information to port scanners.

JasonCoombs commented 5 years ago

because security by obscurity is not security

evoskuil commented 5 years ago

You missed the point. This is just dead code, which has a cost, and, as all code, does translate into increased attack surface.

JasonCoombs commented 5 years ago

if the user agent is already supposed to be transmitted as part of the P2P version message, do you have a clue why nmap isn't able to process ours?

evoskuil commented 5 years ago

Sounds like a fun research project.

JasonCoombs commented 5 years ago

oh, interesting, it worked when nmap was the first and only inbound or outbound connection:

PORT     STATE SERVICE REASON
8333/tcp open  bitcoin syn-ack
| bitcoin-info:
|   Timestamp: 2018-11-06T00:12:39
|   Network: main
|   Version: 0.7.0
|   Node Id: e8fad242281281e2
|   Lastblock: 42561
|_  User Agent: /libbitcoin:4.0.0/
Final times for host: srtt: 149 rttvar: 3772  to: 100000

evoskuil commented 5 years ago

Then it's likely the failure to respond quickly that gets it terminated by the batch session.

evoskuil commented 5 years ago

A port scanner is not a peer, and we have no interest in allowing it to hold the port while there is a more responsive peer.

evoskuil commented 5 years ago

Have you determined that you have current code in all repos?

JasonCoombs commented 5 years ago

A port scanner is not a peer, and we have no interest in allowing it to hold the port while there is a more responsive peer.

there's no way for us to know that nmap is the peer at the point that the peer first establishes a new inbound connection, and even if we did want to prevent nmap from scanning our nodes in my view we should not do so because it's not well-behaved network application design. people need to be able to use network admin tools to admin networks. plus, security through obscurity is not even something that Microsoft believes is valid any longer, so being hidden on the network is self-deception.

JasonCoombs commented 5 years ago

Have you determined that you have current code in all repos?

I've been carefully reproducing any problems that I encounter in my development branch using vanilla copies of everything current in master from all of the repos. the build problem was caused by changes I've made in my development branch, where I do NOT have the latest code from all repos.

When in doubt my second step is always to reproduce with the vanilla libbitcoin code. First step is to bother you again expecting you to fix it for me. ;-)

evoskuil commented 5 years ago

Isolating (and even fixing) problems on your own when you may have that option is much better than generating conversation and work for others like myself who try to remain responsive.

JasonCoombs commented 5 years ago

I realize that my style is cluttering up your GitHub. But there's nobody else here. Maybe being a little LESS responsive will encourage more people to participate. Every other open source project in the entire world, every one of them, usually has work in progress and open issues dating back months or years. Open issues show other people how they might be helpful. You do want other people to help, even if it seems like a waste of your time right now.

evoskuil commented 5 years ago

Discussions are on slack. There are few issues here because we work hard to manage them. A large number of open issues makes it impossible to evaluate the state of a project. If the issues are meaningful, or the discussion helpful, then they are of course welcome. But this isn't a discussion board. Please use the slack channel for casual discussion about passing thoughts and reserve this space for concise useful information that should be archived.

open issues dating back months or years.

Not here. I try very hard to limit open issues to one page in the browser, in each repo. Similarly pull requests are not allowed to linger for long periods of time.

evoskuil commented 5 years ago

Please reopen if this is happening with unmodified code. I am not seeing it with current code.

JasonCoombs commented 5 years ago

Thanks, the problem was here:

https://github.com/libbitcoin/libbitcoin-blockchain/blob/master/src/validate/validate_header.cpp#L106

and here:

https://github.com/libbitcoin/libbitcoin-blockchain/blob/master/src/populate/populate_header.cpp#L48

where my debugging of a null pointer previously needed to be removed now that (hopefully) there are no longer validate header and populate header events being triggered with either missing metadata or null branch pointer. continuing to test this.

only certain peers appear to send headers when blocks are requested, so blacklisting had the effect of avoiding causing either validate header or populate header from being invoked. I managed to sync the entire Testnet with v4 without once encountering unsolicited headers apparently because most peers give us headers only when we ask for them.

JasonCoombs commented 5 years ago

Discussions are on slack.

I don't agree with that policy. Slack is not searchable the way GitHub is. Don't discuss things here if you don't want to, but other people might appreciate more technical detail here where it is more accessible, especially developers new to the project who need to see step by step analysis of how the pieces all fit together. My participation here will diminish soon, so there's no need to tell me to leave.

JasonCoombs commented 5 years ago

After additional testing and analysis of my easily-reproducible segmentation faults using my custom debugging output on my development branch, as well as these Issues: https://github.com/libbitcoin/libbitcoin-node/issues/396 and https://github.com/libbitcoin/libbitcoin-node/issues/397 it is my opinion that there is in fact an unresolved segmentation fault problem.

I believe the segmentation fault is occurring only when an ASIO callback function is triggered in an unexpected order sequence, BEFORE the network request or database/blockchain read has occurred through which the metadata state of a newly-received header is supposed to be populated.

Somehow, missing metadata for the header is resulting in a null pointer during one of these calls:

https://github.com/libbitcoin/libbitcoin-blockchain/blob/master/src/validate/validate_header.cpp#L106

or

https://github.com/libbitcoin/libbitcoin-blockchain/blob/master/src/populate/populate_header.cpp#L48

either on those very lines (attempting to dereference a null pointer) or subsequent to them when the handler is invoked (calling a null pointer, if for some reason the handler function pointer is invalid).

this problem is most likely a result of incorrect use of ASIO, where we improperly presume that ASIO callback functions will be dispatched in a FIFO queue that doesn't exist. the ASIO documentation says that we cannot rely on ASIO to dispatch functions in any particular order, and we are therefore supposed to use a mechanism known as a "strand" which is described here:

https://www.boost.org/doc/libs/1_68_0/doc/html/boost_asio/overview/core/strands.html http://www.crazygaze.com/blog/2016/03/17/how-strands-work-and-why-you-should-use-them/ https://stackoverflow.com/questions/39097644/how-strands-guarantee-correct-execution-of-pending-events-in-boost-asio

the resubscriber utility attempts to use a strand:

https://github.com/libbitcoin/libbitcoin/blob/master/include/bitcoin/bitcoin/impl/utility/resubscriber.ipp#L122

but it seems likely that this is not being used in every place that we have result handlers queued. do we have multiple strands, one per I/O threadpool, and if so then how are we currently enforcing order?

I am having trouble reproducing the segmentation fault without my extensive debug logging calls inserted in the processing, and I believe that my debug logging is changing the timing in such a way that the race condition that exists somewhere more easily reveals itself by dispatching handlers in an order that leaves something uninitialized and attempting to invoke or access a null pointer is the result.

evoskuil commented 5 years ago

While there certainly may be problems, the fact that, based on your comments above you really don’t understand the design, and that it only reproduces in your customized version, I’m inclined to assume it’s your code vs. goin off to chase phantoms.

libbitcoin / libbitcoin-node

Segmentation fault: 11 after connecting outbound channel #392