Open richajaindce opened 1 year ago
was able to reproduce it on 1M records
2023-06-26T19:08:12.360008Z INFO ipa::cli::verbosity: Logging setup at level info
2023-06-26T19:51:30.811996Z INFO ipa::cli::playbook::ipa: Starting query after finishing encryption
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FailedHttpRequest { status: 500, reason: "query execution failed: Infrastructure error: Expected to receive RecordId(999424) but hit end of stream" }', /home/ec2-user/workspace/ipa/src/cli/playbook/ipa.rs:109:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-06-26T19:51:30.901961Z INFO ipa_query: ipa::query::runner::ipa: new
2023-06-26T19:58:15.432130Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432225Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432232Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432246Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432288Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.645819Z INFO ipa_query: ipa::query::runner::ipa: close time.busy=395s time.idle=9.90s
2023-06-26T19:58:15.649718Z ERROR tower_http::trace::on_failure: response failed classification=Status code: 500 Internal Server Error latency=404748 ms
2023-06-26T19:08:06.704294Z INFO ipa::cli::verbosity: Logging setup at level info
2023-06-26T19:08:06.706192Z INFO ipa::net::server: server listening on https://0.0.0.0:443
2023-06-26T19:51:30.901678Z INFO ipa_query: ipa::query::runner::ipa: new
2023-06-26T19:58:15.505650Z INFO ipa_query:modulus_conversion{bits=40 parallel=3 gate=protocol/malicious_protocol/mod_conv_match_key}: ipa::protocol::modulus_conversion::convert_shares: new
2023-06-26T19:08:06.704294Z INFO ipa::cli::verbosity: Logging setup at level info
2023-06-26T19:08:06.706192Z INFO ipa::net::server: server listening on https://0.0.0.0:443
2023-06-26T19:51:30.901678Z INFO ipa_query: ipa::query::runner::ipa: new
2023-06-26T19:58:15.505650Z INFO ipa_query:modulus_conversion{bits=40 parallel=3 gate=protocol/malicious_protocol/mod_conv_match_key}: ipa::protocol::modulus_conversion::convert_shares: new
It is interesting that it fails at record 999,424 out of 1M. Possibly there is a bug where the end of stream is propagated before flushing all of the pending data? The instance in #727 is also very close to end-of-stream (99,096 out of 100k).
It seems unlikely that the proximity to the end of the stream is a coincidence, but if so, it could be that the transfer of inputs is bursty and the connection ends up idle long enough for something to time out.
I've been running a 100k query for 6 hours outside of docker and with no match key encryption. Not a single failure. Starting another 6-10 hour run with match key encryption
this test has been running over the weekend with not a single failure. I likely need Docker to reproduce this failure
Here is what I know so far
--network = host
) eliminates the issueThere is an abrupt connection close event (conntrack shows it) but I don't know yet what triggers it. Maybe a tcp dump can help here
Continuing my investigation this week
2023-07-11T19:08:28.725594Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725649Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725663Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725667Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725671Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
so at 7:08pm, H1 failed to read the stream from H3. I found it to be H2 because there is a matching conntrack entry on H3
Conntrack from H3.
cat ~/workspace/conntrack.log | grep 40930
[1689102180.035280] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 [UNREPLIED] src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930
[1689102180.035437] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930
[1689102180.035501] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930 [ASSURED]
# 7:08:28.720 PM
[1689102508.720026] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930 [ASSURED]
[1689102569.093182] [DESTROY] ipv4 2 tcp 6 src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930 [ASSURED]
so there is an abrupt close and I don't have any other explanation rather than RST packet
Set up firewall rules to log invalid packets
172.31.49.57
172.31.49.227
172.31.56.177
As usual H1 (IP`) received connection reset error
2023-07-12T00:58:22.383516Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-12T00:58:22.383535Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
tcp dump on H1 got the RST packet coming from H3
00:58:22.383307 IP ip-172-31-56-177.us-west-2.compute.internal.51676 > ip-172-31-49-57.us-west-2.compute.internal.https: Flags [R], seq 4008728879, win 0, length 0
on H3 firewall registered invalid packet (coming from H2)
Jul 12 00:58:22 ip-172-31-56-177 kernel: INVALID =eth0 OUT= MAC=0e:7a:56:b9:8e:a7:0e:df:af:76:80:0d:08:00 SRC=172.31.49.227 DST=172.31.56.177 LEN=40 TOS=0x00 PREC=0x00 TTL=254 ID=0 DF PROTO=TCP SPT=47458 DPT=443 WINDOW=0 RES=0x00 RST URGP=0
now looking at the port 47458, it was a legitimate connection from H2 to H3 (logs from H3)
[1689120844.040548] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 [UNREPLIED] src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458
[1689120844.040560] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458
[1689120844.040641] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123502.327115] [UPDATE] ipv4 2 tcp 6 30 LAST_ACK src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123502.327145] [UPDATE] ipv4 2 tcp 6 30 LAST_ACK src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123502.327150] [UPDATE] ipv4 2 tcp 6 120 TIME_WAIT src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
# 12:58:22.339 AM
[1689123502.339533] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123720.709178] [DESTROY] ipv4 2 tcp 6 src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
likely this packet was a result of retransmission and arrived when connection was closed. Connection state for port 51676 on H3
[1689120844.237540] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 [UNREPLIED] src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676
[1689120844.237744] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676
[1689120844.237759] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676 [ASSURED]
[1689123502.383265] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676 [ASSURED]
[1689123942.721182] [DESTROY] ipv4 2 tcp 6 src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676 [ASSURED]
again abrupt close. I still don't understand the root cause here
Another case of connection reset, this time happened on H2 sending RST to H3. I don't see invalid packets this time.
H3
[1689276555.617185] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 [UNREPLIED] src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617346] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617398] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689276902.373562] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689276919.862102] [DESTROY] ipv4 2 tcp 6 src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
H2
[1689276555.617268] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 [UNREPLIED] src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617310] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617435] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689276902.373641] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689277000.693187] [DESTROY] ipv4 2 tcp 6 src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
there is one suspicious entry in conntrack table right before connection close on H2
[1689276789.346607] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=170.187.165.139 dst=172.31.56.177 sport=59085 dport=443 [UNREPLIED] src=172.17.0.2 dst=170.187.165.139 sport=443 dport=59085
[1689276789.346675] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=170.187.165.139 dst=172.31.56.177 sport=59085 dport=443 src=172.17.0.2 dst=170.187.165.139 sport=443 dport=59085
[1689276900.381212] [DESTROY] ipv4 2 tcp 6 src=170.187.165.139 dst=172.31.56.177 sport=59085 dport=443 src=172.17.0.2 dst=170.187.165.139 sport=443 dport=59085
we are subject to syn flood attack for sure, but it should impact our throughput, afaik it cannot cause connection reset.
There seem to be issues with IP masquerading in Docker network
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FailedHttpRequest { status: 500, reason: "query execution failed: Infrastructure error: An error occurred while receiving data from H1/protocol/malicious_protocol/upgrade/upgrade_2d19/upgrade_bit_triple0: Expected to receive RecordId(999126) but hit end of stream" }', /home/ec2-user/workspace/alex-ipa/src/cli/playbook/ipa.rs:109:10
H2 panicked at 17:10:21
2023-07-17T17:10:21.344325Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
H1 (172.31.49.57
) sent a packet with Docker container IP to H2 (172.31.49.227
)
17:10:21.331683 IP (tos 0x0, ttl 254, id 0, offset 0, flags [DF], proto TCP (6), length 40)
ip-172-17-0-2.us-west-2.compute.internal.38664 > ip-172-31-49-227.us-west-2.compute.internal.https: Flags [R], cksum 0x4fb5 (correct), seq 472129838, win 0, length 0
I was suspecting timeouts on the conntrack side, but it seems 5 days (432000) should be enough for IPA to finish
Running `target/release/report_collector --network network.toml --input-file /tmp/alex-ipa-events-1000000.txt malicious-ipa --max-breakdown-key 20 --per-user-credit-cap 1 --plaintext-match-keys`
2023-07-17T19:57:11.359132Z INFO ipa::cli::verbosity: Logging setup at level info
2023-07-17T19:57:13.022025Z INFO ipa::cli::playbook::ipa: Starting query after finishing encryption
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FailedHttpRequest { status: 500, reason: "query execution failed: Infrastructure error: An error occurred while receiving data from H3/protocol/binary_validator/malicious_protocol/upgrade/upgrade_2d6: Expected to receive RecordId(999424) but hit end of stream" }', /home/ec2-user/workspace/alex-ipa/src/cli/playbook/ipa.rs:109:10
2023-07-17T20:57:46.324841Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-17T20:57:46.324861Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-17T20:57:46.324864Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
[1689624726.688232] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 [UNREPLIED] src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588
[1689624726.688245] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588
[1689624726.688357] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689624726.953842] [NEW] ipv4 2 tcp 6 120 SYN_SENT src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 [UNREPLIED] src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750
[1689624726.954003] [UPDATE] ipv4 2 tcp 6 60 SYN_RECV src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750
[1689624726.954007] [UPDATE] ipv4 2 tcp 6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750 [ASSURED]
[1689627466.301802] [UPDATE] ipv4 2 tcp 6 120 FIN_WAIT src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.301833] [UPDATE] ipv4 2 tcp 6 60 CLOSE_WAIT src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.301898] [UPDATE] ipv4 2 tcp 6 30 LAST_ACK src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.314260] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.324625] [UPDATE] ipv4 2 tcp 6 10 CLOSE src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750 [ASSURED]
[1689627661.925182] [DESTROY] ipv4 2 tcp 6 src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627698.757177] [DESTROY] ipv4 2 tcp 6 src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750 [ASSURED]
again the timeline matches H2 sending non-masqueraded packet to H3. For some reason H3 kills the connection to H1 right after it
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:57:46.301798 IP (tos 0x0, ttl 254, id 0, offset 0, flags [DF], proto TCP (6), length 40)
ip-172-17-0-2.us-west-2.compute.internal.45588 > ip-172-31-56-177.us-west-2.compute.internal.https: Flags [R], cksum 0xe703 (correct), seq 1255291223, win 0, length 0
20:57:46.301808 IP (tos 0x0, ttl 254, id 0, offset 0, flags [DF], proto TCP (6), length 40)
ip-172-17-0-2.us-west-2.compute.internal.45588 > ip-172-31-56-177.us-west-2.compute.internal.https: Flags [R], cksum 0xd60b (correct), seq 1255295567, win 0, length 0
A few observations:
iptables -I FORWARD -m conntrack --ctstate INVALID -j DROP
, which doesn't actually fix anything, but may conceal the problem.docker network create --subnet <...>
and then specify the network when creating/running the container. Not a fix, but might work around the problem and offer more data.AWS VPC used by our helpers allows connections to port 443 from everywhere. We are seeing spam requests to connect. Removing this rule eliminates the issue
This seems like a pretty reasonable workaround, no? These spam connections may be the cause of the unexpected RST packets. Then I guess the question would be, is there something weird about the setup that, if changed, would prevent the spam connections from causing unexpected RSTs (e.g. NAT remote IP address on the spam connections), or would block the extra RSTs from causing problems (likely the iptables
command above, or something similar).
There is a suggestion to do iptables -I FORWARD -m conntrack --ctstate INVALID -j DROP, which doesn't actually fix anything, but may conceal the problem.
yea I saw these suggestions as well, but I am still seeing connection resets even after I setup firewall to drop invalid packets. It could be still SNAT but at this point I don't see duplicated ports (for instance I was suspecting spammer to pick the same source port and somehow collude with actual helper port - no duplicated entries in conntrack) or any other evidence that could suggest a collision.
These spam connections may be the cause of the unexpected RST packets.
do you have a theory how they may cause it? I confirmed that it does not seem like a RST attack - the RST packet generated looks legit, originated on helper's side (seeing it on both sides), sequence number is valid as well.
It is interesting that it fails at record 999,424 out of 1M. Possibly there is a bug where the end of stream is propagated before flushing all of the pending data? The instance in #727 is also very close to end-of-stream (99,096 out of 100k).
It seems unlikely that the proximity to the end of the stream is a coincidence, but if so, it could be that the transfer of inputs is bursty and the connection ends up idle long enough for something to time out.
It is happening again at the same record 999424 even with --network=host. This time happening at H2(Linode) with sender as H1(AWS)
[2m2023-08-10T06:11:44.931690Z[0m [31mERROR[0m [1mipa_query[0m[1m{[0m[3msz[0m[2m=[0m1000000[1m}[0m[2m:[0m [2mipa::error[0m[2m:[0m ThreadId(4) "tokio-runtime-worker" panicked at 'called Result::unwrap()
on an Err
value: InfraError(ReceiveError { source: H1, step: "12287", inner: EndOfStream { record_id: RecordId(999424) } })', src/protocol/ipa/mod.rs:346:6
Not sure whats wrong but running IPA with user cap = 1 in real helpers failed for 100k and failed on AWS cluster for 1M rows
Error : thread 'main' panicked at 'called Result::unwrap() on an Err value: FailedHttpRequest { status: 500, reason: "query execution failed: runtime error" }', /home/ec2-user/workspace/ipa/src/cli/playbook/ipa.rs:109:10