private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
41 stars 23 forks source link

TLS issues observed on long running queries #1226

Open eriktaubeneck opened 1 month ago

eriktaubeneck commented 1 month ago

In testing across AWS and GCloud, I noticed some errors seemingly related to TLS. However, the queries do complete. At first, it seemed it was stalling, however it now seems that it simply takes considerably longer across clouds (as expected, when going over the internet.)

As I continued testing, I was able to replicate this on a long run within a single cloud.

Here is a report for an all AWS cluster running 10M rows (ignore the total time, I ran this on smaller servers.)

Given that the query actually completes, this may be a non-issue due to normal happenings of dropped packets and what not, and we can close the issue.

Errors

Helper 1 Error Summary

error_message count
BadCertificate 1
CertificateUnknown 151
Connection reset by peer (os error 104) 3
deadline has elapsed 1
NoCipherSuitesInCommon 11
peer doesn't support any known protocol 4
received corrupt message of type InvalidContentType 9
received corrupt message of type MissingData("ClientHelloPayload") 3
SignatureAlgorithmsExtensionRequired 8
tls handshake eof 26
Tls12NotOffered 4
UnknownCA 1
UnknownIssuer 1

Helper 2 Error Summary

error_message count
BadCertificate 1
Connection reset by peer (os error 104) 2
NoCipherSuitesInCommon 6
peer doesn't support any known protocol 2
received corrupt message of type InvalidContentType 39
received corrupt message of type MissingData("ClientHelloPayload") 4
SignatureAlgorithmsExtensionRequired 4
tls handshake eof 17
Tls12NotOffered 2

Helper 3 Error Summary

error_message count
BadCertificate 1
Connection reset by peer (os error 104) 2
NoCipherSuitesInCommon 12
peer doesn't support any known protocol 6
received corrupt message of type InvalidContentType 12
SignatureAlgorithmsExtensionRequired 4
tls handshake eof 38
Tls12NotOffered 6

Run time stats

Helper 3 Summary - Query Size 10000000

step % idle % busy % total
shuffle_inputs 0.10% 4.22% 0.32%
compute_prf_for_inputs 0.16% 25.62% 1.49%
histograms_ranges_sortkeys 0.00% 0.06% 0.00%
attribute_cap_aggregate 0.17% 9.18% 0.64%
attribute_cap_aggregate 49.20% 33.96% 48.40%
attribute_cap_aggregate 50.26% 26.91% 49.04%
apply_dp_noise 0.11% 0.06% 0.11%
total 13h44m14.7s 45m24.2s 14h29m38.9s

Helper 1 Summary - Query Size 10000000

step % idle % busy % total
shuffle_inputs 0.00% 4.78% 0.25%
compute_prf_for_inputs 0.25% 25.78% 1.56%
histograms_ranges_sortkeys 0.00% 0.06% 0.00%
attribute_cap_aggregate 0.18% 9.14% 0.64%
attribute_cap_aggregate 49.21% 33.58% 48.40%
attribute_cap_aggregate 50.26% 26.60% 49.04%
apply_dp_noise 0.11% 0.06% 0.11%
total 13h44m58.1s 44m40.2s 14h29m38.3s

Helper 2 Summary - Query Size 10000000

step % idle % busy % total
shuffle_inputs 0.08% 4.65% 0.32%
compute_prf_for_inputs 0.18% 25.72% 1.49%
histograms_ranges_sortkeys 0.00% 0.06% 0.00%
attribute_cap_aggregate 0.16% 9.33% 0.64%
attribute_cap_aggregate 49.20% 33.64% 48.40%
attribute_cap_aggregate 50.26% 26.54% 49.04%
apply_dp_noise 0.11% 0.05% 0.11%
total 13h44m50.3s 44m50.1s 14h29m40.4s
eriktaubeneck commented 1 month ago

Looks like some these happen occasionally (though not that often) in draft queries as well.

akoshelev commented 4 weeks ago

I think this may be related to the fact that the ports are open to the public and these errors are coming from crawlers scanning all known ports and trying to establish a TCP/TLS connection. You can confirm that by closing ports on the firewall (VPC for AWS) for all connections except for the known IPs