Open zepumph opened 1 year ago
https://stackoverflow.com/questions/22665232/what-can-cause-chrome-to-give-an-neterr-failed-on-cached-content-against-a-ser helped me reproduce this while recording chrome network traffic. It was important to reproduce with a link like this (from CT) and this chrome recording link
chrome://net-export View log with: https://netlog-viewer.appspot.com/#import
Here is the log viewer for the failed request: '
That is not very helpful to me, but I guess it is good to note.
@mattpen, could we please work to brainstorm what may be happening here? I can't figure out if this is even getting the nginx.
It looks like the 404 is correct to me. Did a snapshot get deleted or renamed somehow?
[mape5853@sparky ct-snapshots]$ cd 1684429403510
-bash: cd: 1684429403510: No such file or directory
[mape5853@sparky ct-snapshots]$ pwd
/data/share/phet/continuous-testing/ct-main/ct-snapshots
[mape5853@sparky ct-snapshots]$
@mattpen, @chrisklus, @zepumph had a meeting. In it, we looked into nginx error logs and found a pretty consistent one like:
SSL_read() failed (SSL: error:0A000126:SSL routines::unexpected eof while reading) while processing HTTP/2 connection
Then after googling we updated to nginx 1.22, and it went away, but that didn't solve our problem.
We didn't make any more progress in the chrome net-export tool, as there really isn't much else to see there.
The best we did was found that this is only a problem with Chrome, and not firefox or safari, so we would like to switch bayes to use firefox, as a workaround, and see if that silences these errors.
Thanks @mattpen and @chrisklus for all your time!
After trying to run firefox on bayes (RHEL 7), I ran into an error that pointed me to https://access.redhat.com/solutions/2853631 and https://bugzilla.redhat.com/show_bug.cgi?id=1369859.
So until we figure out the RHEL 8 upgrade, I'm just going to turn off the bayes clients.
All chrome clients have been turned off on bayes. Unassigning until https://github.com/phetsims/special-ops/issues/242 is done.
Interestingly enough. It looks like this has to do with using https://sparky.colorado.edu as a URL, because on sparky for its local clients, I swapped from 127.0.0.1 and ran into this again. So perhaps this has something to do with things outside of our node server. I'm not sure how to proceed, but would like to touch base with @mattpen one more time.
Does it happen if you use 128.138.93.172? If not, it may be a problem with the DNS server or queries. If it does happen with this IP ... I'm really confused as it looks like after the DNS lookup it would essentially follow the same path as localhost or 127.0.0.1.
[mape5853@sparky ~]$ nslookup sparky.colorado.edu
Server: 128.138.240.1
Address: 128.138.240.1#53
Name: sparky.colorado.edu
Address: 128.138.93.172
[mape5853@sparky ~]$ tracepath 128.138.93.172
1: sparky.colorado.edu 0.082ms reached
Resume: pmtu 65535 hops 1 back 1
[mape5853@sparky ~]$ tracepath 127.0.0.1
1: localhost 0.091ms reached
Resume: pmtu 65535 hops 1 back 1
I'll try it out!
When trying to use https://128.138.93.172/
as the server name for clients on bayes (testing sparky), I get this error:
Fatal error: Hostname/IP does not match certificate's altnames: IP: 128.138.93.172 is not in the cert's list
This is not an issue when using clients directly from sparky.
I am also able to ping the IP address. Thoughts?
~Might need to add 128.138.93.172 as a hostname in the nginx config?~
Hostname/IP does not match certificate's altnames:
I think we might need to generate a new SSL cert that has the IP as an alternate name. Maybe we should just use HTTP instead of SSL. I wonder if there is a way we can allow HTTP from a specific set of IPs instead of the internet in general. Or maybe it just doesn't even matter, the phet sims are accessible over http and we have no plan to change it, e.g. http://phet.colorado.edu/sims/html/friction/latest/friction_all.html
I am blocked by http because for testing, I need to be able to postMessage to https://phet-io.colorado.edu over in https://github.com/phetsims/phet-io/issues/1944.
Looks like http://phet-io.colorado.edu/sims/phet-io-test-sim/2.12.0/ redirects to https, so perhaps I'm still blocked about this?
CT has not been testing since yesterday becuase of the altnames issue:
166|ct-nod | 2023-06-27T14:36:46: (node:4105856) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency
166|ct-nod | (Use `node --trace-warnings ...` to show where the warning was created)
166|ct-nod | You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
166|ct-nod | Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: IP: 128.138.93.172 is not in the cert's list:
166|ct-nod | at new NodeError (node:internal/errors:399:5)
166|ct-nod | at Object.checkServerIdentity (node:tls:337:12)
166|ct-nod | at TLSSocket.onConnectSecure (node:_tls_wrap:1550:27)
166|ct-nod | at TLSSocket.emit (node:events:513:28)
166|ct-nod | at TLSSocket._finishInit (node:_tls_wrap:959:8)
166|ct-nod | at ssl.onhandshakedone (node:_tls_wrap:743:12)
How do we regenerate the SLL cert? Want to pair on this?
The alternative I think would be to reach out to OIT about the general DNS issue, which I believe we both barely understand, and see if we get traction on that. Do you have a preference?
I changed the default :80 block for the nginx.conf file so it doesn't do SSL escalation:
server {
listen 80;
listen [::]:80;
server_name sparky.colorado.edu;
include /etc/nginx/default.d/*.conf;
#return 301 https://sparky.colorado.edu$request_uri;
}
I confirmed that this is no longer enforcing redirects from http://128.138.93.172 to https://sparky.colorado.edu.
// BEFORE
$ curl -I 'http://128.138.93.172/continuous-testing/ct-snapshots/1687902130737/build-a-fraction/build-a-fraction_en.html?continuousTest=%7B%22test%22%3A%5B%22build-a-fraction%22%2C%22xss-fuzz%22%5D%2C%22snapshotName%22%3A%22snapshot-1687902130737%22%2C%22timestamp%22%3A1687902383452%7D&brand=phet&ea&fuzz&stringTest=xss'
HTTP/1.1 301 Moved Permanently
Server: nginx/1.22.1
Date: Tue, 27 Jun 2023 22:05:11 GMT
Content-Type: text/html
Content-Length: 169
Connection: keep-alive
Location: https://sparky.colorado.edu/continuous-testing/ct-snapshots/1687902130737/build-a-fraction/build-a-fraction_en.html?continuousTest=%7B%22test%22%3A%5B%22build-a-fraction%22%2C%22xss-fuzz%22%5D%2C%22snapshotName%22%3A%22snapshot-1687902130737%22%2C%22timestamp%22%3A1687902383452%7D&brand=phet&ea&fuzz&stringTest=xss
// AFTER
$ curl -I 'http://128.138.93.172/continuous-testing/ct-snapshots/1687902130737/build-a-fraction/build-a-fraction_en.html?continuousTest=%7B%22test%22%3A%5B%22build-a-fraction%22%2C%22xss-fuzz%22%5D%2C%22snapshotName%22%3A%22snapshot-1687902130737%22%2C%22timestamp%22%3A1687902383452%7D&brand=phet&ea&fuzz&stringTest=xss'
HTTP/1.1 200 OK
Server: nginx/1.22.1
Date: Tue, 27 Jun 2023 22:07:30 GMT
Content-Type: text/html
Content-Length: 4896
Last-Modified: Tue, 27 Jun 2023 21:42:28 GMT
Connection: keep-alive
ETag: "649b57c4-1320"
Accept-Ranges: bytes
@zepumph -- I hope this might be helpful. If it's not and you'd like to revert, the previous version of the conf is saved in /etc/nginx/nginx.conf.bk_06272023
.
I don't understand the comment about http->https redirects for phet-io.colorado.edu. Are you getting the net err failed problems for that domain too? If so, this is likely a browser issue and not a webserver or DNS server problem.
If we need to add the ip as an alt name to the SSL cert for sparky, we can request a new cert from OIT then install it. My notes that briefly describe how to do this are here: https://github.com/phetsims/website#updating-an-ssl-cert . I'm not sure if these will be useful to people who are not Matt Pennington, OIT's instructions for requesting certs are here: https://oit.colorado.edu/services/web-content-applications/ssl-certificates and Nginx instructions for installing certs are here: http://nginx.org/en/docs/http/configuring_https_servers.html. But I'd be happy to update the cert.
After discussing with @mattpen:
Thanks @mattpen for working with me so much on this.
New SSL cert with the IP included as an SAN (Subject Alternative Name) has been requested with OIT.
https://github.com/phetsims/aqua/issues/185#issuecomment-1608381570 I am blocked by http because for testing, I need to be able to postMessage to https://phet-io.colorado.edu/
In https://github.com/phetsims/phet-io/issues/1944#issuecomment-1612160549 this was solved by
by https://github.com/phetsims/aqua/commit/f838cc4649b0aba4ab0986804f1be043aefed28a. Sorry to relate that piece of the puzzle over here. I believe that I am fully unblocked by using http
, and I also still think we should continue investigating where and how the err_faileds are occurring so that this doesn't bite us worse at some point in the future.
Let me know when the static IP is ready for testing on https.
Looks like IP Addresses are not eligible to add as SANs in the Sectigo tool provided by CU. So we won't be able to test https://{{ip address}}, unless we self-sign a cert which is likely to raise additional errors.
I believe that once we have upgraded bayes over in https://github.com/phetsims/special-ops/issues/242 this issue is worth another look. It would be nice to have bayes clients for CT back in here.
About half of the errors on CT are this right now