Closed jcsp closed 2 years ago
@graphcareful did you end up finding something here? i recall a discussion on slack about this.
I've been taking a look at the nightly build runs since @ajfabbri 's commit of vtools/785, but I haven't seen anything. My guess is, it was a rouge connection. 785 might help us track down what machine made that request.
Closing as this hasn't been seen since, will assume rogue connection was the cause of the spurious error logs
More incidences:
FAIL test: ClusterConfigTest.test_restart (1/2 runs)
failure at 2022-07-31T11:04:20.402Z:
I can buy that this is a symptom of a rogue client, but we still need to get to the bottom of it
Agreed, I'm seeing IP addresses that start with 198.x.x.x in the logs
report.txt: <BadLogLines nodes=ip-172-31-62-254(1) example="ERROR 2022-07-31 06:41:46,763 [shard 0] rpc - server.cc:126
- kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.143:34111 - std::runtime_error (Unexpected EOF f
or client ID)">
However when taking a look at the associated cluster.json
in the results root, I do not see any IPs that match. Can we confirm then that this client is not a portion of our deployed infra for CDT?
➜ ducktape cat cluster.json
{
"nodes": [
{
"externally_routable_ip": "35.86.201.197",
"ssh_config": {
"host": "ip-172-31-59-139",
"hostname": "172.31.59.139",
"identityfile": "/home/ubuntu/.ssh/id_rsa",
"port": 22,
"user": "root",
"password": null
}
},
{
"externally_routable_ip": "44.242.173.148",
"ssh_config": {
"host": "ip-172-31-51-160",
"hostname": "172.31.51.160",
"identityfile": "/home/ubuntu/.ssh/id_rsa",
"port": 22,
"user": "root",
"password": null
}
},
{
"externally_routable_ip": "35.87.222.235",
"ssh_config": {
"host": "ip-172-31-55-1",
"hostname": "172.31.55.1",
"identityfile": "/home/ubuntu/.ssh/id_rsa",
"port": 22,
"user": "root",
"password": null
}
},
....
etc
Can we confirm then that this client is not a portion of our deployed infra for CDT?
No idea -- it might be interesting if you try and to spin up a clustered ducktape environment and e.g. poke around in netstat to see if you can sniff out any traffic from funny looking IPs, and also check that the nodes aren't listening to the outside world: they're not supposed to be (terraform controls security group rules for these instances) but worth checking.
Slight variation here:
FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete (1/29 runs)
failure at 2022-08-07T08:32:31.679Z: <BadLogLines nodes=ip-172-31-56-167(1) example="ERROR 2022-08-07 06:32:58,874 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 162.142.125.211:36738 - std::runtime_error (Invalid client_id size -27453)">
in job https://buildkite.com/redpanda/vtools/builds/3133#01827642-8678-41b8-a243-c532aa5f80ce
And same variation here:
failure at 2022-08-08T07:13:23.597Z: <BadLogLines nodes=ip-172-31-60-231(1) example="ERROR 2022-08-08 03:41:32,002 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.3:60085 - std::runtime_error (Unexpected EOF for client ID)">
in job https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe
And here:
FAIL test: NodeOperationFuzzyTest.test_node_operations.enable_failures=True.num_to_upgrade=3 (1/2 runs)
failure at 2022-08-08T07:13:23.597Z: <BadLogLines nodes=ip-172-31-48-220(1) example="ERROR 2022-08-08 04:26:36,936 [shard 2] rpc - server.cc:116 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.141:52343 - std::runtime_error (Unexpected EOF for client ID)">
in job https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe
https://buildkite.com/redpanda/vtools/builds/3156#018285b5-c1c2-4484-b724-4dcd4e03d7a3
Module: rptest.tests.partition_move_interruption_test
Class: PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments:
{
"recovery": "restart_recovery",
"replication_factor": 1,
"unclean_abort": false
}
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-58-85(1) example="ERROR 2022-08-10 05:11:29,018 [shard 3] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 162.142.125.222:54538 - std::runtime_error (Invalid client_id size -29814)">
And
Module: rptest.tests.partition_movement_test
Class: SIPartitionMovementTest
Method: test_cross_shard
Arguments:
{
"num_to_upgrade": 2
}
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-52-45(1) example="ERROR 2022-08-10 05:02:05,921 [shard 0] rpc - server.cc:116 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.142:57045 - std::runtime_error (Unexpected EOF for client ID)">
https://buildkite.com/redpanda/vtools/builds/3183#01828adc-074b-45e5-b224-1245d61997dc
Module: rptest.scale_tests.ht_partition_movement_test
Class: HighThroughputPartitionMovementTest
Method: test_interrupting_partition_movement_under_load
Arguments:
{
"replication_factor": 3
}
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-63-199(1) example="ERROR 2022-08-11 05:40:47,472 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.141:43031 - std::runtime_error (Unexpected EOF for client ID)">
https://buildkite.com/redpanda/vtools/builds/3220#01829002-7fe5-4775-aba3-8fa06d20b3d3
Module: rptest.scale_tests.ht_partition_movement_test
Class: HighThroughputPartitionMovementTest
Method: test_interrupting_partition_movement_under_load
Arguments:
{
"replication_factor": 1
}
After doing an IP address lookup on one of the client IPs:
IP Address: 162.142.125.222
Hostname: scanner-25.ch1.censys-scanner.com
Location: Michigan, USA
looks like it is a port scanning tool. I will add some additional logging in redpanda to confirm the data parsed before the failure is inconsistent with the Kafka protocol.
I think devprod (@ivotron ) should chime in here. I see the CDT installation code has all of the nodes able to be publicly accessible from the outside internet (advertise_public_ips: true
), probably useful for integration with buildkite and our tooling, but it isn't necessary for all ports to be accessible, like 9092
. I think we should only really expose what is necessary to start and obtain results from buildkite, probably only 22
and have the other ports only reachable from other CDT nodes.
I think we should only really expose what is necessary to start and obtain results from buildkite, probably only 22 and have the other ports only reachable from other CDT nodes.
I agree. I think we've ended up in an in-between state where we're going to the trouble of running ducktape remotely in an EC2 instance in order to avoid relying on public access to all the nodes, but nothing was actually put in place to prevent the nodes being publicly accessible.
In fact, 9092 is explicitly allowed in from cluster.tf (most other ports are not)
ingress {
from_port = 9092
to_port = 9092
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
Maybe we just remove that cidr_blocks
from the 9092 ingress rule.
Seen again here
FAIL test: HighThroughputPartitionMovementTest.test_interrupting_partition_movement_under_load.replication_factor=3 (1/1 runs)
failure at 2022-08-17T07:38:31.224Z:
Maybe we just remove that cidr_blocks from the 9092 ingress rule.
Will this affect our ability to run the ducktape runner from our laptops?
Maybe we just remove that cidr_blocks from the 9092 ingress rule.
Will this affect our ability to run the ducktape runner from our laptops?
Yes, if this was working at all currently? I think most of the other opened ports were not exposed to the open internet already, so I'd be surprised if all of the tests were working if you ran ducktape from your laptop.
I don't love that we run ducktape from a dedicated VM, but I think really supporting running from a remote machine will require a bit of thought, possibly defaulting to mTLS for kafka connections and/or refactoring all the places tests currently expect to speak port 9092 directly to nodes under test (rather than using a producer/consumer service)
Another instance: https://buildkite.com/redpanda/vtools/builds/3339#0182c381-f5d0-413b-a501-4d7eae01ff51
FAIL test: TestMirrorMakerService.test_consumer_group_mirroring.source_type=kafka (1/17 runs)
failure at 2022-08-22T07:24:31.179Z: <BadLogLines nodes=ip-172-31-54-137(1) example="ERROR 2022-08-22 03:41:15,042 [shard 3] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.2:63489 - std::runtime_error (Unexpected EOF for client ID, client_id_size: 11825, header: {key:12064, version:18516, correlation_id:1414541105, client_id:nullopt, number_of_tagged_fields: 0, tags_size_bytes:0})">
in job https://buildkite.com/redpanda/vtools/builds/3339#0182c381-f5d0-413b-a501-4d7eae01ff51
@graphcareful this is still assigned to you: are you working on closing the ports?
@jcsp i've actually been trying to get in touch with the author(s) of ducky.py or CDT to get more information. I'll ask around on devprod channels
Two more occurrences: https://buildkite.com/redpanda/vtools/builds/3386#0182cdd0-bf0c-491d-8f68-f302f435c994
Leaving an update here, so i've been running a few tests from my working PR here https://github.com/redpanda-data/vtools/pull/900
The last run is here: https://buildkite.com/redpanda/vtools/builds/3408#0182d532-5135-4032-9a61-4fa5640f4a1a
With the last batch of changes it looks like all tests are passing with the exceptions of the ones that use the kgo verifier.
FAIL: TimeoutError('Timed out waiting for status endpoint KgoVerifierProducer-0-140154481953040 to be available')
[DEBUG - 2022-08-25 15:55:05,324 - kgo_verifier_services - is_ready - lineno:226]: Status endpoint KgoVerifierProducer-0-140154681963376 not ready: HTTPConnectionPool(host='ip-172-31-60-73', port=8080): Max retries exceeded with url: /status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f78421b3070>: Failed to establish a new connection: [Errno 111] Connection refused'))
I'm thinking maybe kgo-verifier
isn't ready because it has been blocked due to the closure of 9092 somehow?
EDIT: After seeing some of the recent modifications to the kgo-verifier
class and how the dependency itself was recently updated I think I just needed to rebase my vtools branch against dev. Re-running new test now
Sorry, those "Status endpoint unavailable" ones are my fault, I forgot to do this: https://github.com/redpanda-data/vtools/pull/910
Looks like it worked:
SESSION REPORT (ALL TESTS)
| ducktape version: 0.8.8
| session_id: 2022-08-26--001
| run time: 74 minutes 53.254 seconds
| tests run: 506
| passed: 451
| failed: 0
| ignored: 0
| opassed: 55
| ofailed: 0
Manually closing this issue as resolved due to merge of: https://github.com/redpanda-data/vtools/pull/900 please re-open if issue is still observed
Bunch of tests affected by this in the last 24h, all running in clustered ducktape:
FAIL test: OpenBenchmarkTest.test_multiple_topics_omb (1/2 runs) failure at 2022-07-07T10:51:23.073Z:
in job https://buildkite.com/redpanda/vtools/builds/2806#0181d77e-2fe1-4594-876a-062eb75b50c0
FAIL test: ManyClientsTest.test_many_clients (1/2 runs) failure at 2022-07-06T10:53:03.664Z:
in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77
FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=ACKS_ALL.cleanup_policy=compact (1/58 runs) failure at 2022-07-06T10:53:03.664Z:
in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77
FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/58 runs) failure at 2022-07-06T10:53:03.664Z:
in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77
It's not obvious why we're seeing more of these all of a sudden -- the code that generates that exception hasn't changed recently. Previously I see only 3 instances of this in the entire last 30 days.