failure to consume BT messages just after handshake

reardonia commented 3 months ago

What is the issue?

This appears to be long-standing behavior, certainly part of Tr 4.0.x and mainline.

In the normal event processing loop, data that has been received but unprocessed is discarded when a peer drops the connection. This is unfortunately true even at the very beginning of a message exchange, just after handshake. At the end of handshake (after the 68-byte core payload is exchange), clients start exchanging regular BT messages beginning most often with Ltep exchange. The Ltep exchange most importantly establishes whether we and/or peer are seeds. The problem is that the connection is dropped as soon as either side sees that the other is seed (aka "have all" pieces). For Tr4, that means the Ltep messages are never processed, because the cascade of io::try_read -> peer_socket::try_read -> bufferWriter::add_socket just return an error and the peer is torn down without further processing of messages that are left in the read buffer. I've noted that qbittorrent is one of the peers that closes as early as it can but it presumes that we have read its Ltep messages; we have not.

The overriding question is: should Transmission always process the BT message stream completely on (peer) socket close?

There is significant downside to current behavior. The client never learns that the peer is a seed, so it just keeps retrying to peer endlessly, every second. This leads to an enormous amount of chatter and housekeeping of swarms even when those swarms are all-seeds. (I will file a separate bug in the swarm_is_all_seeds logic which is currently failing during the announce-scrape gap).

Which application of Transmission?

transmission-daemon

Which version of Transmission?

Tr4, 4.0x and mainline

tearfur commented 3 months ago

I'm confused by your claims:

The Ltep exchange most importantly establishes whether we and/or peer are seeds.

It doesn't. AFAIK, this is done either by the bitfield message from BEP-3, or the Have All/Have None messages from BEP-6. If I missed something, can you quote from BEP-10, which mechanism achieves this?

Either way, I don't think this is important here.
For Tr4, that means the Ltep messages are never processed, because the cascade of io::try_read -> peer_socket::try_read -> bufferWriter::add_socket just return an error and the peer is torn down without further processing of messages that are left in the read buffer.

This doesn't make sense to me. For this to happen, it would mean that the peer closing the socket (by sending TCP Fin for example) would cause the OS of our client to free its read buffers. I can't imagine that any well-established OS is written like that.

I can't quote any OS source code for now, but it'd make more sense for the read buffers to be freed only after our client calls close().

tearfur commented 3 months ago

Expanding on the (2) from my previous comment.

Maybe you are thinking that tearing down a peer might cause us to lose the unprocessed data stored in tr_peerIo::inbuf_. The truth is, any data that is left unprocessed after a tr_peerIo::try_read() call is either an incomplete, or malformed BT message, so they are meaningless to us anyway.

This is ensured by calling can_read_wrapper() inside try_read():

https://github.com/transmission/transmission/blob/d42d0f3f3f50c01cda1d9f7180d1f4a3ef8aba63/libtransmission/peer-io.cc#L451-L467

If this answers your questions, please close the issue.

reardonia commented 3 months ago

I'm confused by your claims:

The Ltep exchange most importantly establishes whether we and/or peer are seeds.

It doesn't. AFAIK, this is done either by the bitfield message from BEP-3, or the Have All/Have None messages from BEP-6. If I missed something, can you quote from BEP-10, which mechanism achieves this? Either way, I don't think this is important here.

Sorry, I lumped the "fast" extension messages from BEP4 in with the LTEP messages in BEP10. The problem is that the BEP 4/6 extension messages are not consumed.

reardonia commented 3 months ago

Expanding on the (2) from my previous comment.

Maybe you are thinking that tearing down a peer might cause us to lose the unprocessed data stored in tr_peerIo::inbuf_. The truth is, any data that is left unprocessed after a tr_peerIo::try_read() call is either an incomplete, or malformed BT message, so they are meaningless to us anyway.

This is ensured by calling can_read_wrapper() inside try_read():

https://github.com/transmission/transmission/blob/d42d0f3f3f50c01cda1d9f7180d1f4a3ef8aba63/libtransmission/peer-io.cc#L451-L467

If this answers your questions, please close the issue.

I think you are pointing right at the problem: "if (error) { do NOT call can_read_wrapper }"

As you said, there is data in inbuf_ that came in at the end of the handshake. But it will never be consumed. Maybe I'm misunderstanding the way peerMsgs::can_read works? But as far as I can see, the only time it is ever called is from can_read_wrapper

reardonia commented 3 months ago

fwiw, here is a trace with additional "have=" info about contents of inbuf_, illustrating the problem:

[2024-06-10 21:31:32.283] trc 32.219.###.###:62989 libevent says this peer socket is ready for reading (peer-io.cc:486)
[2024-06-10 21:31:32.283] trc 32.219.###.###:62989 enabling ready-to-read polling (peer-io.cc:511)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 handling can_read; state is [awaiting vc] (handshake.cc:600)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 found ENCRYPT(VC)! (handshake.cc:165)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 len(PadB) is 273 (handshake.cc:168)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 crypto select is 2 (handshake.cc:194)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 len(PadD) is 14 (handshake.cc:203)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 PadD: need 14, got 226 (handshake.cc:216)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 read_handshake: need 48, got 212 (handshake.cc:242)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 read_peer_id: need 20, got 164 (handshake.cc:332)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 peer-id is 'µTorrent 2.2.1' ... isIncoming is false (handshake.cc:342)
[2024-06-10 21:31:32.283] trc handshake 32.219.###.###:62989 144 more bytes remain after handshake (handshake.cc:352)
[2024-06-10 21:31:32.283] trc TORRENTNAME create_bit_torrent_peer 32.219.###.###:62989 (peer-mgr.cc:1275)
[2024-06-10 21:31:32.283] trc TORRENTNAME 32.219.###.###:62989 [µTorrent 2.2.1]: sending an ltep handshake (peer-msgs.cc:1090)
[2024-06-10 21:31:32.283] trc TORRENTNAME 32.219.###.###:62989 [µTorrent 2.2.1]: sending 'ltep' 0 [] (peer-msgs.cc:859)
[2024-06-10 21:31:32.283] trc TORRENTNAME 32.219.###.###:62989 [µTorrent 2.2.1]: sending 'fext-have-all' (peer-msgs.cc:859)
[2024-06-10 21:31:32.283] trc 32.219.###.###:62989 enabling ready-to-write polling (peer-io.cc:523)
[2024-06-10 21:31:32.283] trc 32.219.###.###:62989 libevent says this peer socket is ready for writing (peer-io.cc:359)
[2024-06-10 21:31:32.283] trc 32.219.###.###:62989 libevent says this peer socket is ready for reading (peer-io.cc:486)
[2024-06-10 21:31:32.283] trc 32.219.###.###:62989 try_read err: n_read:0 errno:107 (Transport endpoint is not connected) have:144 (peer-io.cc:468)
[2024-06-10 21:31:32.283] dbg TORRENTNAME setting 32.219.###.###:62989 OUTGOING do_purge flag because we got [(107) Transport endpoint is not connected] (peer-mgr.cc:816)

reardonia commented 3 months ago

I'm confused by your claims:

The Ltep exchange most importantly establishes whether we and/or peer are seeds.

It doesn't. AFAIK, this is done either by the bitfield message from BEP-3, or the Have All/Have None messages from BEP-6. If I missed something, can you quote from BEP-10, which mechanism achieves this? Either way, I don't think this is important here.

Sorry, I lumped the "fast" extension messages from BEP4 in with the LTEP messages in BEP10. The problem is that the BEP 4/6 extension messages are not consumed.

Actually, I think I was correct in the initial assertion: BEP21 includes the LTEP "upload_only" message, which designates as seed (or partial-seed).

reardonia commented 3 months ago

The handling of seeds seems a bit messy, tbh.

(1) If we detect a seed via the "have_all" message, we go down one path, with an observer/emit protocol. This doesn't seem to do anything other that set the bitfield hint. ~~Seems like a waste of time?~~ No "seed" state is ever changed. a set_have_all() is done, but we don't use that to check is_seed.... mostly.
(2) If we detect a seed via "upload_only", we just set ADDED_F_SEED_FLAG in the pex object. Doesn't look like we do that in response to "have_all". (3) then there is peerIo::is_seed, which has a getter but no setter. isseed is set on instantiation and stuck. (4) the abstract class tr_peer has no local seed state and instead uses the has().has_all() method against the bitfield object.

Seems like they should all be handled the same way, with explicit seed/upload_only status separate from bitfield, and it should be maintained consistently across object lifetimes. Currently, none of the seed discovery is propogated back to the connectable_pool (aka the atom list) except in the case of incoming connection that gave us a full LTEP ipv4+port update. Transmission may be the only client that actually sends all that; qBT certainly doesn't.

In addition, we have the problem (related to #6910 ) where a seed is never removed from the candidate atom list. Instead, we just keep reconnecting to it every 2 seconds...which I guess is better than every 1 second.

AFAICT, the only way to be skipped in the candidate atom list is to be mark_peer_as_seed, via the swarm_is_all_seeds logic that applies to private trackers.

Is it worth breaking this out to another issue?

reardonia commented 3 months ago

this is a hack, but gives an idea towards solving the propagation of seed-status to connectable_pool :

in peer-mgr:

[[nodiscard]] bool shouldPeerBeClosed(tr_swarm const* s, tr_peerMsgs const* peer, size_t peer_count, time_t const now)
{
    if (peer->is_seed())
    {
        tr_logAddTraceSwarm(s, fmt::format("peer {} is_seed so lets drop? torrent is_done {}", peer->display_name(), s->tor->is_done()));
        auto p_i = s->get_existing_peer_info(peer->socket_address());
        if (p_i) p_i->set_seed();
    }

tearfur commented 3 months ago

I think you are pointing right at the problem: "if (error) { do NOT call can_read_wrapper }"

I think I see what's wrong now. The code you pointed out here serves to reveal the problem, but it is not the root cause. Please try out my PR.

tearfur commented 3 months ago

Is it worth breaking this out to another issue?

Yes please, I think some of your points are valid. I will respond in detail over there once the issue is up.

transmission / transmission