Error `connection id could not be verified`

josecelano commented 2 months ago

I've been using the demo tracker logs, and there are a lot of errors verifying the connection ID:

tracker  | 2024-09-10T12:07:18.588007Z ERROR process_request:handle_packet:handle_request:handle_announce{remote_addr=x.x.x.x:5330 announce_request=AnnounceRequest { connection_id: ConnectionId(I64(5555803590518745593)), action_placeholder: AnnounceActionPlaceholder(I32(1)), transaction_id: TransactionId(I32(-1689416333)), info_hash: InfoHash([150, 11, 1, 98, 95, 68, 96, 255, 224, 206, 19, 87, 8, 196, 243, 90, 108, 48, 40, 241]), peer_id: PeerId([45, 66, 67, 48, 50, 48, 50, 45, 62, 61, 89, 40, 3, 137, 229, 116, 139, 237, 6, 109]), bytes_downloaded: NumberOfBytes(I64(0)), bytes_left: NumberOfBytes(I64(7667712)), bytes_uploaded: NumberOfBytes(I64(0)), event: AnnounceEventBytes(I32(0)), ip_address: Ipv4AddrBytes([0, 0, 0, 0]), key: PeerKey(I32(46505)), peers_wanted: NumberOfPeers(I32(200)), port: Port(U16(20020)) }}: torrust_tracker::servers::udp::handlers: error=connection id could not be verified
tracker  | 2024-09-10T12:07:18.602411Z ERROR process_request:handle_packet:handle_request:handle_announce{remote_addr=x.x.x.x:3210 announce_request=AnnounceRequest { connection_id: ConnectionId(I64(-8281966802190684203)), action_placeholder: AnnounceActionPlaceholder(I32(1)), transaction_id: TransactionId(I32(1028202677)), info_hash: InfoHash([235, 97, 210, 21, 134, 188, 225, 199, 111, 88, 34, 70, 227, 133, 53, 163, 223, 212, 16, 219]), peer_id: PeerId([45, 66, 67, 48, 49, 57, 50, 45, 179, 134, 156, 221, 255, 153, 186, 188, 248, 43, 205, 103]), bytes_downloaded: NumberOfBytes(I64(0)), bytes_left: NumberOfBytes(I64(0)), bytes_uploaded: NumberOfBytes(I64(0)), event: AnnounceEventBytes(I32(0)), ip_address: Ipv4AddrBytes([0, 0, 0, 0]), key: PeerKey(I32(35442)), peers_wanted: NumberOfPeers(I32(200)), port: Port(U16(17545)) }}: torrust_tracker::servers::udp::handlers: error=connection id could not be verified

josecelano commented 2 weeks ago

It seems there are a couple of IPs making a lot of requests:

From: 2024-11-15T11:50:26.058130Z (initial record time in logs) To: 2024-11-15T11:59:14.013546Z (final record time in logs)

This is the number of records containing the error per IP:

  11249 ***.181.47.215
  10321 ***.38.196.30
   1499 ***.223.68.238
    951 ***.119.120.243
    943 ***.0.0.0
    935 ***.101.195.3
    834 ***.65.200.220
    630 ***.203.102.228
    574 ***.223.5.76
    535 ***.155.138.252
    460 ***.151.164.47
    428 ***.69.249.56
    393 ***.222.100.18
    386 ***.29.114.32
    363 ***.0.5.10
    347 ***.234.37.79
    308 ***.89.28.234
    301 ***.229.191.13
    288 ***.97.50.183
    276 ***.158.172.143
    274 ***.97.50.172
    258 ***.107.158.197
    247 ***.229.106.137
    241 ***.226.159.156
    214 ***.149.119.171
    211 ***.166.14.69
    206 ***.178.10.168
    206 ***.98.131.32
    200 ***.163.118.85
    195 ***.74.11.23
    192 ***.147.21.36
    191 ***.141.206.245
    181 ***.18.34.133
    177 ***.164.14.112
    170 ***.44.119.149
    151 ***.80.9.163
    149 ***.233.109
    144 ***.95.40.108
    138 ***.80.9.243
    136 ***.81.185.211
    134 ***.111.17.112
    127 ***.52.86.157
    127 ***.80.9.208
    127 ***.203.250.69
    125 ***.182.100.130
    121 ***.72.33.10
    121 ***.172.5.182
    120 ***..80.9.65
    119 ***.88.128.13
    119 ***.90.177.100
    118 ***.87.174.105
    114 ***.141.100.59
    112 ***.219.30.138
    112 ***.225.218.245
    110 ***.140.134.59
    105 ***.49.128.219
    104 ***.19.60.153
    100 ***.18.134.24

josecelano commented 2 weeks ago

I've tried with another UDP tracker client and it works:

josecelano commented 2 weeks ago

I've added a new UDP tracker to the demo on a new port 6868. It's working using our tracker client and also qBitorrent:

$ cargo run --bin udp_tracker_client  announce 144.126.245.19:6868 9c38422213e30bff212b30c360d26f9a02136422
   Compiling torrust-tracker-client v3.0.0-develop (/home/josecelano/Documents/git/committer/me/github/torrust/torrust-tracker/console/tracker-client)
    Finished `dev` profile [optimized + debuginfo] target(s) in 7.78s
     Running `/home/josecelano/Documents/git/committer/me/github/torrust/torrust-tracker/target/debug/udp_tracker_client announce '144.126.245.19:6868' 9c38422213e30bff212b30c360d26f9a02136422`
sending connection request...
connection_id: ConnectionId(I64(-9193186782767578663))
sending announce request...
{
  "AnnounceIpv4": {
    "transaction_id": -888840697,
    "announce_interval": 300,
    "leechers": 0,
    "seeders": 1,
    "peers": []
  }
}

Either, the clients are not sending the connection ID correctly or there is a bug when the load is high.

I'm going to enable the trace level to check if the clients are sending the connection ID they get in the connect request.

cc @da2ce7

josecelano commented 2 weeks ago

I think there is another possible cause. If announce requests take longer than 2 minutes to be processed, the connection ID is expired.

da2ce7 commented 2 weeks ago

Let's move to symmetric encrypted connection ID's so that we can give an error response telling when it expired, (or just invalid).

da2ce7 commented 2 weeks ago

@josecelano I have created a draft (unfinished for now) implmentation of the encrypted connection ID:

https://github.com/torrust/torrust-tracker/pull/1091

josecelano commented 2 weeks ago

@josecelano I have created a draft (unfinished for now) implmentation of the encrypted connection ID:

https://github.com/torrust/torrust-tracker/pull/1091

Hi @da2ce7, cool, I will take a look tomorrow. We have to check the performance, I think the hashing for the connection ID was one of the things that takes a big percentage of the total time of request processing.

josecelano commented 2 weeks ago

Hi @da2ce7, I have updated the tracker demo after merging your PR. I've checked the logs, and we are still getting many errors. However, now we know the exact reason:

tracker  | 2024-11-19T11:55:34.607629Z ERROR process_request:handle_packet{cookie_time_values=CookieTimeValues { issue_time: 1732017334.607587, valid_range: 1732017213.607587..1732017335.607587 }}:handle_request{cookie_time_values=CookieTimeValues { issue_time: 1732017334.607587, valid_range: 1732017213.607587..1732017335.607587 }}:handle_announce{remote_addr=x.x.243.90:55801 announce_request=AnnounceRequest { connection_id: ConnectionId(I64(5089751780245633787)), action_placeholder: AnnounceActionPlaceholder(I32(1)), transaction_id: TransactionId(I32(86974521)), info_hash: InfoHash([3, 193, 2, 218, 90, 252, 215, 97, 138, 105, 135, 38, 246, 146, 158, 204, 101, 126, 251, 181]), peer_id: PeerId([45, 113, 66, 52, 54, 51, 65, 45, 40, 53, 55, 107, 45, 88, 126, 116, 102, 121, 90, 121]), bytes_downloaded: NumberOfBytes(I64(2441216)), bytes_left: NumberOfBytes(I64(5933010514)), bytes_uploaded: NumberOfBytes(I64(0)), event: AnnounceEventBytes(I32(0)), ip_address: Ipv4AddrBytes([0, 0, 0, 0]), key: PeerKey(I32(-1872933448)), peers_wanted: NumberOfPeers(I32(200)), port: Port(U16(6279)) } cookie_valid_range=1732017213.607587..1732017335.607587}: torrust_tracker::servers::udp::handlers: error=connection id produces a future value
tracker  | 2024-11-19T11:55:34.608274Z ERROR process_request:handle_packet{cookie_time_values=CookieTimeValues { issue_time: 1732017334.6081936, valid_range: 1732017213.6081936..1732017335.6081936 }}:handle_request{cookie_time_values=CookieTimeValues { issue_time: 1732017334.6081936, valid_range: 1732017213.6081936..1732017335.6081936 }}:handle_announce{remote_addr=x.x.243.90:55801 announce_request=AnnounceRequest { connection_id: ConnectionId(I64(5089751780245633787)), action_placeholder: AnnounceActionPlaceholder(I32(1)), transaction_id: TransactionId(I32(-1123371308)), info_hash: InfoHash([177, 53, 173, 166, 50, 153, 156, 70, 195, 247, 241, 9, 154, 50, 139, 168, 220, 166, 79, 5]), peer_id: PeerId([45, 113, 66, 52, 54, 51, 65, 45, 67, 51, 76, 126, 119, 117, 73, 90, 49, 105, 69, 99]), bytes_downloaded: NumberOfBytes(I64(859106175)), bytes_left: NumberOfBytes(I64(285212672)), bytes_uploaded: NumberOfBytes(I64(450582314)), event: AnnounceEventBytes(I32(0)), ip_address: Ipv4AddrBytes([0, 0, 0, 0]), key: PeerKey(I32(1502953212)), peers_wanted: NumberOfPeers(I32(200)), port: Port(U16(6279)) } cookie_valid_range=1732017213.6081936..1732017335.6081936}: torrust_tracker::servers::udp::handlers: error=connection id produces a future value

It would be convenient to include the unencrypted issue time in the logs to know the exact value. Anyway, I'm going to continue with my debug plan. I want to check if these failing requests have a previous connect request from the same IP. If not, this could be just spam or a DDoS attack.

da2ce7 commented 2 weeks ago

@josecelano I will try and make another PR that makes the log messages nicer for this case.

josecelano commented 2 weeks ago

Hi @da2ce7, I'm getting many announce request errors from my BitTorrent client. I think it is because the demo tracker is not responding in time. The server CPU has been under 50% in the last 14 days. I suppose the problem is that we have a lot of request errors in the logs for this issue, and that makes the UDP tracker slower because it has to write the error to the log file.

I think we should either:

Ignore errors in production envs or this error (I don't like it).
We try to ban (temporarily) clients (IPs), producing many errors. However, this would also make the tracker slower because we have to count requests and check that the request comes from a non-banned client.

Could it be that some clients don't make the connect request, or they make it wrong?

da2ce7 commented 1 week ago

Here is what the KI says.

Given your requirements, using a Counting Bloom Filter (CBF) or a similar probabilistic data structure can be adapted to handle this scenario effectively. Here's how you could implement a solution that addresses your concerns:

Solution Overview:

Hierarchical Counting Bloom Filters:
- Individual IP Layer: Use one CBF to track individual IP addresses. This will allow for fine-grained detection of misbehaving IPs.
- Subnet Layer: Implement another level of CBFs where each filter covers a subnet range. This allows for detecting patterns of misbehavior across a subnet without penalizing neighboring IPs inappropriately.

Implementation Details:

Individual IP CBF:
- Each IP address is hashed into this filter.
- When an error occurs, increment the count associated with that IP's hash in the CBF.
- If the count exceeds a threshold, you can take action against that specific IP (e.g., rate limiting or temporary banning).
Subnet CBF:
- Instead of hashing individual IPs, hash the subnet address (e.g., the network part of an IP address).
- When multiple IPs within a subnet misbehave, their errors are aggregated into this subnet's count.
- If the count for a subnet exceeds a different, higher threshold, you can apply measures at the subnet level, like rate limiting traffic from that subnet.

Advantages:

Granularity: This approach gives you both detailed control over individual IPs and the ability to detect broader patterns of misbehavior within subnets.
Performance: CBFs are efficient in terms of memory usage and speed, allowing for quick lookups and updates even with large datasets.
Flexibility: You can adjust the thresholds for individual IPs and subnets separately, allowing for different levels of tolerance based on your policy or observed behavior patterns.
False Positives: While CBFs have a small chance of false positives, by using multiple levels (individual and subnet), you can mitigate the impact. For example, if an IP is flagged at both levels, it's more likely to be a true positive.

Challenges:

Configuration: You need to decide on the size of the CBFs, the number of hash functions, and the error thresholds for both individual IPs and subnets. This requires some experimentation or simulation to find the right balance between false positives, memory usage, and effectiveness.
Complexity: Managing two layers of CBFs introduces additional complexity in terms of implementation and maintenance.
False Positives at Subnet Level: If a subnet contains both misbehaving and well-behaving IPs, the well-behaving ones might suffer from the actions taken against the subnet.

Implementation Steps:

Decide on the Subnet Size: Determine what constitutes a subnet for your purposes (e.g., /24, /16, etc.).
Initialize CBFs:
- Create one CBF for individual IPs.
- Create another CBF for subnets, where each bucket represents a subnet.
Error Handling Logic:
- When an error occurs:
  - Hash the IP to update the individual IP CBF.
  - Extract the subnet from the IP and hash it to update the subnet CBF.
Action Protocol:
- If an individual IP's count exceeds a threshold, apply rate limiting or other measures to that IP.
- If a subnet's count exceeds a higher threshold, consider similar measures but at the subnet level.
Decay Mechanism: Implement a decay or aging process for counts to ensure that past behavior doesn't indefinitely affect current interactions unless the behavior persists.

By employing this hierarchical approach with Counting Bloom Filters, you can effectively manage IP-based errors at different levels of granularity, protecting your network's performance while minimizing the impact on innocent IPs.

josecelano commented 1 week ago

Hi @da2ce7, I've downloaded 3-minute tracker logs with trace level.

Socket addresses ordered by number of logs records (10 first ones):

$ grep -oP "\d+\.\d+\.\d+\.\d+:\d+" ./tracker_logs.txt | sort | uniq -c | sort -nr | head
  84398 0.0.0.0:6969
   4632 *.142.69.160:53234
   3311 *.248.77.85:38473
   2581 *.241.177.250:43822
   2276 *.182.117.236:43124
   2216 *.86.246.61:1024
   1895 *.118.235.66:6881
   1846 *.38.196.30:6816
   1845 *.181.47.215:6817
   1818 *.181.47.215:6812

IP addresses ordered by number of logs records (10 first ones):

$ grep -oP "\d+\.\d+\.\d+\.\d+" ./tracker_logs.txt | sort | uniq -c | sort -nr | head
  85291 0.0.0.0
  13435 *.38.196.30
  12789 *.181.47.215
   5052 *.65.200.220
   4869 *.119.120.243
   4785 *.79.3.85
   4632 *.142.69.160
   3895 *.248.77.85
   3734 *.165.243.54
   3403 *.236.91.11

Socket addresses (for a given IP, the second one in number of logs records) ordered by number of logs records (10 first ones):

$ grep -oP "*.38.196.30:.\d+" ./tracker_logs.txt | sort | uniq -c | sort -nr | head
   1846 *.38.196.30:6816
   1755 *.38.196.30:6812
   1728 *.38.196.30:6814
   1710 *.38.196.30:6811
   1689 *.38.196.30:6817
   1602 *.38.196.30:6813
   1566 *.38.196.30:6818
   1539 *.38.196.30:6815

I've also checked one socket address. The client is making the connect request, but it's not using the received ID in the announce request. It's always using this connection ID:

ConnectionId(I64(-2325979768337461225))

The 0.0.0.0 IP on the lists above are due to these log records:

tracker  | 2024-11-20T10:04:09.991917Z TRACE run_udp_server_main{cookie_lifetime=120s}: UDP TRACKER: Udp::run_udp_server (wait for request) local_addr="udp://0.0.0.0:6969"
tracker  | 2024-11-20T10:04:09.991942Z TRACE run_udp_server_main{cookie_lifetime=120s}: UDP TRACKER: Udp::run_udp_server::loop (in) local_addr="udp://0.0.0.0:6969"

I guess this is a bad client implementation. I can email you the logs if you want to check other things.

da2ce7 commented 1 week ago

@josecelano I will try and make another PR that makes the log messages nicer for this case.

https://github.com/torrust/torrust-tracker/pull/1095

torrust / torrust-demo