"Unexpected EOF for client ID" errors in logs (multiple tests)

jcsp commented 2 years ago

Bunch of tests affected by this in the last 24h, all running in clustered ducktape:

FAIL test: OpenBenchmarkTest.test_multiple_topics_omb (1/2 runs) failure at 2022-07-07T10:51:23.073Z: in job https://buildkite.com/redpanda/vtools/builds/2806#0181d77e-2fe1-4594-876a-062eb75b50c0

FAIL test: ManyClientsTest.test_many_clients (1/2 runs) failure at 2022-07-06T10:53:03.664Z: in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77

FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=ACKS_ALL.cleanup_policy=compact (1/58 runs) failure at 2022-07-06T10:53:03.664Z: in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77

FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/58 runs) failure at 2022-07-06T10:53:03.664Z: in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77

It's not obvious why we're seeing more of these all of a sudden -- the code that generates that exception hasn't changed recently. Previously I see only 3 instances of this in the entire last 30 days.

dotnwat commented 2 years ago

@graphcareful did you end up finding something here? i recall a discussion on slack about this.

graphcareful commented 2 years ago

I've been taking a look at the nightly build runs since @ajfabbri 's commit of vtools/785, but I haven't seen anything. My guess is, it was a rouge connection. 785 might help us track down what machine made that request.

graphcareful commented 2 years ago

Closing as this hasn't been seen since, will assume rogue connection was the cause of the spurious error logs

jcsp commented 2 years ago

More incidences:

FAIL test: ClusterConfigTest.test_restart (1/2 runs) failure at 2022-07-31T11:04:20.402Z: in job https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e FAIL test: FranzGoVerifiableWithSiTest.test_si_with_timeboxed.segment_size=20971520 (1/2 runs) failure at 2022-07-31T11:04:20.402Z: in job https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e

I can buy that this is a symptom of a rogue client, but we still need to get to the bottom of it

graphcareful commented 2 years ago

Agreed, I'm seeing IP addresses that start with 198.x.x.x in the logs

report.txt:    <BadLogLines nodes=ip-172-31-62-254(1) example="ERROR 2022-07-31 06:41:46,763 [shard 0] rpc - server.cc:126 
- kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.143:34111 - std::runtime_error (Unexpected EOF f
or client ID)">

However when taking a look at the associated cluster.json in the results root, I do not see any IPs that match. Can we confirm then that this client is not a portion of our deployed infra for CDT?

➜  ducktape cat cluster.json                                 
{                                                            
    "nodes": [                                               
        {                                                    
            "externally_routable_ip": "35.86.201.197",
            "ssh_config": {                                  
                "host": "ip-172-31-59-139",
                "hostname": "172.31.59.139",                 
                "identityfile": "/home/ubuntu/.ssh/id_rsa",
                "port": 22,                                  
                "user": "root",                              
                "password": null                           
            }                                                
        },                                                   
        {                                                    
            "externally_routable_ip": "44.242.173.148",
            "ssh_config": {                                  
                "host": "ip-172-31-51-160",
                "hostname": "172.31.51.160",                 
                "identityfile": "/home/ubuntu/.ssh/id_rsa",
                "port": 22,                                  
                "user": "root",                              
                "password": null                           
            }                                                
        },                                                   
        {                                                    
            "externally_routable_ip": "35.87.222.235",
            "ssh_config": {                                  
                "host": "ip-172-31-55-1",
                "hostname": "172.31.55.1",                   
                "identityfile": "/home/ubuntu/.ssh/id_rsa",
                "port": 22,                                  
                "user": "root",                              
                "password": null                           
            }                                                
        },                                                   
....
etc

jcsp commented 2 years ago

Can we confirm then that this client is not a portion of our deployed infra for CDT?

No idea -- it might be interesting if you try and to spin up a clustered ducktape environment and e.g. poke around in netstat to see if you can sniff out any traffic from funny looking IPs, and also check that the nodes aren't listening to the outside world: they're not supposed to be (terraform controls security group rules for these instances) but worth checking.

BenPope commented 2 years ago

https://ci-artifacts.dev.vectorized.cloud/vtools/01827b68-ebc9-4ff8-867f-d9461479bafe/vbuild/ducktape/results/2022-08-08--001/TopicRecreateTest/test_topic_recreation_while_producing/workload=ACKS_1.cleanup_policy=delete/188/

BenPope commented 2 years ago

Slight variation here:

FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete (1/29 runs)
  failure at 2022-08-07T08:32:31.679Z: <BadLogLines nodes=ip-172-31-56-167(1) example="ERROR 2022-08-07 06:32:58,874 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 162.142.125.211:36738 - std::runtime_error (Invalid client_id size -27453)">

in job https://buildkite.com/redpanda/vtools/builds/3133#01827642-8678-41b8-a243-c532aa5f80ce

And same variation here:

  failure at 2022-08-08T07:13:23.597Z: <BadLogLines nodes=ip-172-31-60-231(1) example="ERROR 2022-08-08 03:41:32,002 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.3:60085 - std::runtime_error (Unexpected EOF for client ID)">

in job https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

And here:

FAIL test: NodeOperationFuzzyTest.test_node_operations.enable_failures=True.num_to_upgrade=3 (1/2 runs)
  failure at 2022-08-08T07:13:23.597Z: <BadLogLines nodes=ip-172-31-48-220(1) example="ERROR 2022-08-08 04:26:36,936 [shard 2] rpc - server.cc:116 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.141:52343 - std::runtime_error (Unexpected EOF for client ID)">

in job https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

dotnwat commented 2 years ago

https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

BenPope commented 2 years ago

https://buildkite.com/redpanda/vtools/builds/3156#018285b5-c1c2-4484-b724-4dcd4e03d7a3

BenPope commented 2 years ago

https://buildkite.com/redpanda/vtools/builds/3156#018285b5-c1c2-4484-b724-4dcd4e03d7a3

Module: rptest.tests.partition_move_interruption_test
Class:  PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments:
{
  "recovery": "restart_recovery",
  "replication_factor": 1,
  "unclean_abort": false
}

rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-58-85(1) example="ERROR 2022-08-10 05:11:29,018 [shard 3] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 162.142.125.222:54538 - std::runtime_error (Invalid client_id size -29814)">

And

Module: rptest.tests.partition_movement_test
Class:  SIPartitionMovementTest
Method: test_cross_shard
Arguments:
{
  "num_to_upgrade": 2
}

rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-52-45(1) example="ERROR 2022-08-10 05:02:05,921 [shard 0] rpc - server.cc:116 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.142:57045 - std::runtime_error (Unexpected EOF for client ID)">

BenPope commented 2 years ago

https://buildkite.com/redpanda/vtools/builds/3183#01828adc-074b-45e5-b224-1245d61997dc

Module: rptest.scale_tests.ht_partition_movement_test
Class:  HighThroughputPartitionMovementTest
Method: test_interrupting_partition_movement_under_load
Arguments:
{
  "replication_factor": 3
}

rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-63-199(1) example="ERROR 2022-08-11 05:40:47,472 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.141:43031 - std::runtime_error (Unexpected EOF for client ID)">

BenPope commented 2 years ago

https://buildkite.com/redpanda/vtools/builds/3220#01829002-7fe5-4775-aba3-8fa06d20b3d3

Module: rptest.scale_tests.ht_partition_movement_test
Class:  HighThroughputPartitionMovementTest
Method: test_interrupting_partition_movement_under_load
Arguments:
{
  "replication_factor": 1
}

graphcareful commented 2 years ago

After doing an IP address lookup on one of the client IPs:

IP Address: 162.142.125.222
Hostname: scanner-25.ch1.censys-scanner.com
Location: Michigan, USA

looks like it is a port scanning tool. I will add some additional logging in redpanda to confirm the data parsed before the failure is inconsistent with the Kafka protocol.

I think devprod (@ivotron ) should chime in here. I see the CDT installation code has all of the nodes able to be publicly accessible from the outside internet (advertise_public_ips: true), probably useful for integration with buildkite and our tooling, but it isn't necessary for all ports to be accessible, like 9092. I think we should only really expose what is necessary to start and obtain results from buildkite, probably only 22 and have the other ports only reachable from other CDT nodes.

jcsp commented 2 years ago

I think we should only really expose what is necessary to start and obtain results from buildkite, probably only 22 and have the other ports only reachable from other CDT nodes.

I agree. I think we've ended up in an in-between state where we're going to the trouble of running ducktape remotely in an EC2 instance in order to avoid relying on public access to all the nodes, but nothing was actually put in place to prevent the nodes being publicly accessible.

In fact, 9092 is explicitly allowed in from cluster.tf (most other ports are not)

  ingress {
    from_port   = 9092
    to_port     = 9092
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

Maybe we just remove that cidr_blocks from the 9092 ingress rule.

andrwng commented 2 years ago

Seen again here

FAIL test: HighThroughputPartitionMovementTest.test_interrupting_partition_movement_under_load.replication_factor=3 (1/1 runs) failure at 2022-08-17T07:38:31.224Z: in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80

dotnwat commented 2 years ago

Maybe we just remove that cidr_blocks from the 9092 ingress rule.

Will this affect our ability to run the ducktape runner from our laptops?

jcsp commented 2 years ago

Maybe we just remove that cidr_blocks from the 9092 ingress rule.

Will this affect our ability to run the ducktape runner from our laptops?

Yes, if this was working at all currently? I think most of the other opened ports were not exposed to the open internet already, so I'd be surprised if all of the tests were working if you ran ducktape from your laptop.

I don't love that we run ducktape from a dedicated VM, but I think really supporting running from a remote machine will require a bit of thought, possibly defaulting to mTLS for kafka connections and/or refactoring all the places tests currently expect to speak port 9092 directly to nodes under test (rather than using a producer/consumer service)

bharathv commented 2 years ago

Another instance: https://buildkite.com/redpanda/vtools/builds/3339#0182c381-f5d0-413b-a501-4d7eae01ff51

FAIL test: TestMirrorMakerService.test_consumer_group_mirroring.source_type=kafka (1/17 runs)
  failure at 2022-08-22T07:24:31.179Z: <BadLogLines nodes=ip-172-31-54-137(1) example="ERROR 2022-08-22 03:41:15,042 [shard 3] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.2:63489 - std::runtime_error (Unexpected EOF for client ID, client_id_size: 11825, header: {key:12064, version:18516, correlation_id:1414541105, client_id:nullopt, number_of_tagged_fields: 0, tags_size_bytes:0})">
      in job https://buildkite.com/redpanda/vtools/builds/3339#0182c381-f5d0-413b-a501-4d7eae01ff51

jcsp commented 2 years ago

@graphcareful this is still assigned to you: are you working on closing the ports?

graphcareful commented 2 years ago

@jcsp i've actually been trying to get in touch with the author(s) of ducky.py or CDT to get more information. I'll ask around on devprod channels

bharathv commented 2 years ago

Two more occurrences: https://buildkite.com/redpanda/vtools/builds/3386#0182cdd0-bf0c-491d-8f68-f302f435c994

graphcareful commented 2 years ago

Leaving an update here, so i've been running a few tests from my working PR here https://github.com/redpanda-data/vtools/pull/900

The last run is here: https://buildkite.com/redpanda/vtools/builds/3408#0182d532-5135-4032-9a61-4fa5640f4a1a

With the last batch of changes it looks like all tests are passing with the exceptions of the ones that use the kgo verifier.

FAIL: TimeoutError('Timed out waiting for status endpoint KgoVerifierProducer-0-140154481953040 to be available')

[DEBUG - 2022-08-25 15:55:05,324 - kgo_verifier_services - is_ready - lineno:226]: Status endpoint KgoVerifierProducer-0-140154681963376 not ready: HTTPConnectionPool(host='ip-172-31-60-73', port=8080): Max retries exceeded with url: /status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f78421b3070>: Failed to establish a new connection: [Errno 111] Connection refused'))

I'm thinking maybe kgo-verifier isn't ready because it has been blocked due to the closure of 9092 somehow?

EDIT: After seeing some of the recent modifications to the kgo-verifier class and how the dependency itself was recently updated I think I just needed to rebase my vtools branch against dev. Re-running new test now

jcsp commented 2 years ago

Sorry, those "Status endpoint unavailable" ones are my fault, I forgot to do this: https://github.com/redpanda-data/vtools/pull/910

graphcareful commented 2 years ago

Looks like it worked:

SESSION REPORT (ALL TESTS)
  | ducktape version: 0.8.8
  | session_id:       2022-08-26--001
  | run time:         74 minutes 53.254 seconds
  | tests run:        506
  | passed:           451
  | failed:           0
  | ignored:          0
  | opassed:          55
  | ofailed:          0

graphcareful commented 2 years ago

Manually closing this issue as resolved due to merge of: https://github.com/redpanda-data/vtools/pull/900 please re-open if issue is still observed

redpanda-data / redpanda

"Unexpected EOF for client ID" errors in logs (multiple tests) #5380