private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
42 stars 25 forks source link

query execution failed: runtime error #728

Open richajaindce opened 1 year ago

richajaindce commented 1 year ago

Not sure whats wrong but running IPA with user cap = 1 in real helpers failed for 100k and failed on AWS cluster for 1M rows

Error : thread 'main' panicked at 'called Result::unwrap() on an Err value: FailedHttpRequest { status: 500, reason: "query execution failed: runtime error" }', /home/ec2-user/workspace/ipa/src/cli/playbook/ipa.rs:109:10

akoshelev commented 1 year ago

was able to reproduce it on 1M records

RC

2023-06-26T19:08:12.360008Z  INFO ipa::cli::verbosity: Logging setup at level info
2023-06-26T19:51:30.811996Z  INFO ipa::cli::playbook::ipa: Starting query after finishing encryption
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FailedHttpRequest { status: 500, reason: "query execution failed: Infrastructure error: Expected to receive RecordId(999424) but hit end of stream" }', /home/ec2-user/workspace/ipa/src/cli/playbook/ipa.rs:109:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

H1

2023-06-26T19:51:30.901961Z  INFO ipa_query: ipa::query::runner::ipa: new
2023-06-26T19:58:15.432130Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432225Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432232Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432246Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.432288Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-06-26T19:58:15.645819Z  INFO ipa_query: ipa::query::runner::ipa: close time.busy=395s time.idle=9.90s
2023-06-26T19:58:15.649718Z ERROR tower_http::trace::on_failure: response failed classification=Status code: 500 Internal Server Error latency=404748 ms

H2

2023-06-26T19:08:06.704294Z  INFO ipa::cli::verbosity: Logging setup at level info
2023-06-26T19:08:06.706192Z  INFO ipa::net::server: server listening on https://0.0.0.0:443
2023-06-26T19:51:30.901678Z  INFO ipa_query: ipa::query::runner::ipa: new
2023-06-26T19:58:15.505650Z  INFO ipa_query:modulus_conversion{bits=40 parallel=3 gate=protocol/malicious_protocol/mod_conv_match_key}: ipa::protocol::modulus_conversion::convert_shares: new

H3

2023-06-26T19:08:06.704294Z  INFO ipa::cli::verbosity: Logging setup at level info
2023-06-26T19:08:06.706192Z  INFO ipa::net::server: server listening on https://0.0.0.0:443
2023-06-26T19:51:30.901678Z  INFO ipa_query: ipa::query::runner::ipa: new
2023-06-26T19:58:15.505650Z  INFO ipa_query:modulus_conversion{bits=40 parallel=3 gate=protocol/malicious_protocol/mod_conv_match_key}: ipa::protocol::modulus_conversion::convert_shares: new
andyleiserson commented 1 year ago

It is interesting that it fails at record 999,424 out of 1M. Possibly there is a bug where the end of stream is propagated before flushing all of the pending data? The instance in #727 is also very close to end-of-stream (99,096 out of 100k).

It seems unlikely that the proximity to the end of the stream is a coincidence, but if so, it could be that the transfer of inputs is bursty and the connection ends up idle long enough for something to time out.

akoshelev commented 1 year ago

I've been running a 100k query for 6 hours outside of docker and with no match key encryption. Not a single failure. Starting another 6-10 hour run with match key encryption

akoshelev commented 1 year ago

this test has been running over the weekend with not a single failure. I likely need Docker to reproduce this failure

akoshelev commented 1 year ago

Here is what I know so far

There is an abrupt connection close event (conntrack shows it) but I don't know yet what triggers it. Maybe a tcp dump can help here

akoshelev commented 1 year ago

Continuing my investigation this week

2023-07-11T19:08:28.725594Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725649Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725663Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725667Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-11T19:08:28.725671Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset

so at 7:08pm, H1 failed to read the stream from H3. I found it to be H2 because there is a matching conntrack entry on H3

Conntrack from H3.

cat ~/workspace/conntrack.log  | grep 40930

[1689102180.035280]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 [UNREPLIED] src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930
[1689102180.035437]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930
[1689102180.035501]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930 [ASSURED]
# 7:08:28.720 PM
[1689102508.720026]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930 [ASSURED]
[1689102569.093182] [DESTROY] ipv4     2 tcp      6 src=172.17.0.2 dst=172.31.49.57 sport=40930 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=40930 [ASSURED]

so there is an abrupt close and I don't have any other explanation rather than RST packet

akoshelev commented 1 year ago

Set up firewall rules to log invalid packets

As usual H1 (IP`) received connection reset error

2023-07-12T00:58:22.383516Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-12T00:58:22.383535Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset

tcp dump on H1 got the RST packet coming from H3

00:58:22.383307 IP ip-172-31-56-177.us-west-2.compute.internal.51676 > ip-172-31-49-57.us-west-2.compute.internal.https: Flags [R], seq 4008728879, win 0, length 0

on H3 firewall registered invalid packet (coming from H2)

Jul 12 00:58:22 ip-172-31-56-177 kernel: INVALID =eth0 OUT= MAC=0e:7a:56:b9:8e:a7:0e:df:af:76:80:0d:08:00 SRC=172.31.49.227 DST=172.31.56.177 LEN=40 TOS=0x00 PREC=0x00 TTL=254 ID=0 DF PROTO=TCP SPT=47458 DPT=443 WINDOW=0 RES=0x00 RST URGP=0

now looking at the port 47458, it was a legitimate connection from H2 to H3 (logs from H3)

[1689120844.040548]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 [UNREPLIED] src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458
[1689120844.040560]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458
[1689120844.040641]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123502.327115]  [UPDATE] ipv4     2 tcp      6 30 LAST_ACK src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123502.327145]  [UPDATE] ipv4     2 tcp      6 30 LAST_ACK src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123502.327150]  [UPDATE] ipv4     2 tcp      6 120 TIME_WAIT src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
# 12:58:22.339 AM
[1689123502.339533]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]
[1689123720.709178] [DESTROY] ipv4     2 tcp      6 src=172.31.49.227 dst=172.31.56.177 sport=47458 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47458 [ASSURED]

likely this packet was a result of retransmission and arrived when connection was closed. Connection state for port 51676 on H3

[1689120844.237540]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 [UNREPLIED] src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676
[1689120844.237744]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676
[1689120844.237759]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676 [ASSURED]
[1689123502.383265]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676 [ASSURED]
[1689123942.721182] [DESTROY] ipv4     2 tcp      6 src=172.17.0.2 dst=172.31.49.57 sport=51676 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=51676 [ASSURED]

again abrupt close. I still don't understand the root cause here

akoshelev commented 1 year ago

Another case of connection reset, this time happened on H2 sending RST to H3. I don't see invalid packets this time.

H3

[1689276555.617185]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 [UNREPLIED] src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617346]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617398]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689276902.373562]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689276919.862102] [DESTROY] ipv4     2 tcp      6 src=172.17.0.2 dst=172.31.56.177 sport=47904 dport=443 src=172.31.56.177 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]

H2

[1689276555.617268]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 [UNREPLIED] src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617310]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904
[1689276555.617435]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689276902.373641]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]
[1689277000.693187] [DESTROY] ipv4     2 tcp      6 src=172.31.49.227 dst=172.31.56.177 sport=47904 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=47904 [ASSURED]

there is one suspicious entry in conntrack table right before connection close on H2

[1689276789.346607]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=170.187.165.139 dst=172.31.56.177 sport=59085 dport=443 [UNREPLIED] src=172.17.0.2 dst=170.187.165.139 sport=443 dport=59085
[1689276789.346675]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=170.187.165.139 dst=172.31.56.177 sport=59085 dport=443 src=172.17.0.2 dst=170.187.165.139 sport=443 dport=59085
[1689276900.381212] [DESTROY] ipv4     2 tcp      6 src=170.187.165.139 dst=172.31.56.177 sport=59085 dport=443 src=172.17.0.2 dst=170.187.165.139 sport=443 dport=59085

we are subject to syn flood attack for sure, but it should impact our throughput, afaik it cannot cause connection reset.

akoshelev commented 1 year ago

There seem to be issues with IP masquerading in Docker network

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FailedHttpRequest { status: 500, reason: "query execution failed: Infrastructure error: An error occurred while receiving data from H1/protocol/malicious_protocol/upgrade/upgrade_2d19/upgrade_bit_triple0: Expected to receive RecordId(999126) but hit end of stream" }', /home/ec2-user/workspace/alex-ipa/src/cli/playbook/ipa.rs:109:10

H2 panicked at 17:10:21

2023-07-17T17:10:21.344325Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset

H1 (172.31.49.57) sent a packet with Docker container IP to H2 (172.31.49.227)

 17:10:21.331683 IP (tos 0x0, ttl 254, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    ip-172-17-0-2.us-west-2.compute.internal.38664 > ip-172-31-49-227.us-west-2.compute.internal.https: Flags [R], cksum 0x4fb5 (correct), seq 472129838, win 0, length 0
akoshelev commented 1 year ago

I was suspecting timeouts on the conntrack side, but it seems 5 days (432000) should be enough for IPA to finish

     Running `target/release/report_collector --network network.toml --input-file /tmp/alex-ipa-events-1000000.txt malicious-ipa --max-breakdown-key 20 --per-user-credit-cap 1 --plaintext-match-keys`
2023-07-17T19:57:11.359132Z  INFO ipa::cli::verbosity: Logging setup at level info
2023-07-17T19:57:13.022025Z  INFO ipa::cli::playbook::ipa: Starting query after finishing encryption
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FailedHttpRequest { status: 500, reason: "query execution failed: Infrastructure error: An error occurred while receiving data from H3/protocol/binary_validator/malicious_protocol/upgrade/upgrade_2d6: Expected to receive RecordId(999424) but hit end of stream" }', /home/ec2-user/workspace/alex-ipa/src/cli/playbook/ipa.rs:109:10
2023-07-17T20:57:46.324841Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-17T20:57:46.324861Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
2023-07-17T20:57:46.324864Z ERROR ipa_query: ipa::helpers::transport::receive: error reading records: error reading a body from connection: connection reset
[1689624726.688232]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 [UNREPLIED] src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588
[1689624726.688245]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588
[1689624726.688357]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689624726.953842]     [NEW] ipv4     2 tcp      6 120 SYN_SENT src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 [UNREPLIED] src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750
[1689624726.954003]  [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750
[1689624726.954007]  [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750 [ASSURED]
[1689627466.301802]  [UPDATE] ipv4     2 tcp      6 120 FIN_WAIT src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.301833]  [UPDATE] ipv4     2 tcp      6 60 CLOSE_WAIT src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.301898]  [UPDATE] ipv4     2 tcp      6 30 LAST_ACK src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.314260]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627466.324625]  [UPDATE] ipv4     2 tcp      6 10 CLOSE src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750 [ASSURED]
[1689627661.925182] [DESTROY] ipv4     2 tcp      6 src=172.31.49.227 dst=172.31.56.177 sport=45588 dport=443 src=172.17.0.2 dst=172.31.49.227 sport=443 dport=45588 [ASSURED]
[1689627698.757177] [DESTROY] ipv4     2 tcp      6 src=172.17.0.2 dst=172.31.49.57 sport=39750 dport=443 src=172.31.49.57 dst=172.31.56.177 sport=443 dport=39750 [ASSURED]

again the timeline matches H2 sending non-masqueraded packet to H3. For some reason H3 kills the connection to H1 right after it

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:57:46.301798 IP (tos 0x0, ttl 254, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    ip-172-17-0-2.us-west-2.compute.internal.45588 > ip-172-31-56-177.us-west-2.compute.internal.https: Flags [R], cksum 0xe703 (correct), seq 1255291223, win 0, length 0
20:57:46.301808 IP (tos 0x0, ttl 254, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    ip-172-17-0-2.us-west-2.compute.internal.45588 > ip-172-31-56-177.us-west-2.compute.internal.https: Flags [R], cksum 0xd60b (correct), seq 1255295567, win 0, length 0
andyleiserson commented 1 year ago

A few observations:

AWS VPC used by our helpers allows connections to port 443 from everywhere. We are seeing spam requests to connect. Removing this rule eliminates the issue

This seems like a pretty reasonable workaround, no? These spam connections may be the cause of the unexpected RST packets. Then I guess the question would be, is there something weird about the setup that, if changed, would prevent the spam connections from causing unexpected RSTs (e.g. NAT remote IP address on the spam connections), or would block the extra RSTs from causing problems (likely the iptables command above, or something similar).

akoshelev commented 1 year ago

There is a suggestion to do iptables -I FORWARD -m conntrack --ctstate INVALID -j DROP, which doesn't actually fix anything, but may conceal the problem.

yea I saw these suggestions as well, but I am still seeing connection resets even after I setup firewall to drop invalid packets. It could be still SNAT but at this point I don't see duplicated ports (for instance I was suspecting spammer to pick the same source port and somehow collude with actual helper port - no duplicated entries in conntrack) or any other evidence that could suggest a collision.

These spam connections may be the cause of the unexpected RST packets.

do you have a theory how they may cause it? I confirmed that it does not seem like a RST attack - the RST packet generated looks legit, originated on helper's side (seeing it on both sides), sequence number is valid as well.

richajaindce commented 1 year ago

It is interesting that it fails at record 999,424 out of 1M. Possibly there is a bug where the end of stream is propagated before flushing all of the pending data? The instance in #727 is also very close to end-of-stream (99,096 out of 100k).

It seems unlikely that the proximity to the end of the stream is a coincidence, but if so, it could be that the transfer of inputs is bursty and the connection ends up idle long enough for something to time out.

It is happening again at the same record 999424 even with --network=host. This time happening at H2(Linode) with sender as H1(AWS) 2023-08-10T06:11:44.931690Z ERROR ipa_query{sz=1000000}: ipa::error: ThreadId(4) "tokio-runtime-worker" panicked at 'called Result::unwrap() on an Err value: InfraError(ReceiveError { source: H1, step: "12287", inner: EndOfStream { record_id: RecordId(999424) } })', src/protocol/ipa/mod.rs:346:6