get_job_status fails if it connects to backup master

kranskydog commented 1 year ago

Summary

the get_job_status API fails if it happens to hit the backup master in multi-server config . This does not happen with other API calls

Steps to reproduce the problem

1) create multi server setup with primary and backup masters 2) call get_job_status API to primary master - works 3) call get_job_status API to backup master - fails with

Your Setup

Virtualbox Cronicle 0.9.38 2 master servers (primary, backup)- access via round-robin DNS (Oracle SCAN IPs) to virtual hostname conf/config.json has "web_direct_connect": true, 2 other worker servers can connect to web console via virtual hostname and everything works as expected. Can use other APIs against both master nodes and they work correctly ie

Operating system and version?

[root@orcl01 ~]# cat /etc/oracle-release Oracle Linux Server release 7.9 [root@orcl01 ~]# uname -a Linux orcl01.example.com 5.4.17-2136.324.5.3.el7uek.x86_64 #2 SMP Tue Oct 10 12:44:19 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux

Node.js version?

v16.20.2

Cronicle software version?

0.9.38

Are you using a multi-server setup, or just a single server?

Multi

Are you using the filesystem as back-end storage, or S3/Couchbase?

filesystem (cluster)

Can you reproduce the crash consistently?

yes

Log Excerpts

Can't see anything specific

jhuckaby commented 1 year ago

Okay, so, here is the thing. The get_job_status API is actually working as designed. This API only works on the master node. If you hit a backup node, it returns a HTTP 302 redirect over to the master. This is explained in the docs here:

https://github.com/jhuckaby/Cronicle/blob/master/docs/APIReference.md#redirects

I cannot explain why you are seeing that weird "protocol violation" error, or where that is even coming from. Some kind of proxy server you have in the middle, which isn't expecting a HTTP 302? Dunno.

Anyway, here is the thing. The get_history API, which you cite as an example of something working correctly, is actually not 😝 . That API is failing to check if the current server is master before running, which is a bug.

I will fix that.

kranskydog commented 1 year ago

Hmmmm [apache@apchop01 ~]$ wget "http://orcl02.example.com:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02"--2023-11-02 10:49:07-- http://orcl02.example.com:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02 Resolving orcl02.example.com (orcl02.example.com)... 192.168.56.55 Connecting to orcl02.example.com (orcl02.example.com)|192.168.56.55|:3012... connected. HTTP request sent, awaiting response... 302 Found Location: http://::ffff:192.168.56.50:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02 [following] http://::ffff:192.168.56.50:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02: Invalid host name.

IPV6?

kranskydog commented 1 year ago

[apache@apchop01 ~]$ curl -v -L "http://orcl02.example.com:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02"

Uses proxy env variable no_proxy == 'example.com'
Trying 192.168.56.55:3012...
Connected to orcl02.example.com (192.168.56.55) port 3012 (#0)

GET /api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02 HTTP/1.1 Host: orcl02.example.com:3012 User-Agent: curl/7.76.1 Accept: /
Mark bundle as not supporting multiuse < HTTP/1.1 302 Found < Location: http://::ffff:192.168.56.50:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02 < Content-Type: application/json < Access-Control-Allow-Origin: * < Server: Cronicle 1.0 < Content-Length: 90 < Date: Thu, 02 Nov 2023 00:52:09 GMT < Connection: keep-alive < Keep-Alive: timeout=5 <
Ignoring the response-body
Connection #0 to host orcl02.example.com left intact curl: (3) URL using bad/illegal format or missing URL

jhuckaby commented 1 year ago

Okay, that is really bizarre. Your backup server thinks that the master server's IP address is ::ffff:192.168.56.50. I've never seen that before.

What does your server data look like? Try:

/opt/cronicle/bin/storage-cli.js list_get global/servers

Are the IPs munged in there as well? I'm still trying to fathom how this could possibly have happened.

kranskydog commented 1 year ago

[root@orcl02 cronicle]# /opt/cronicle/bin/storage-cli.js list_get global/servers Got 4 items. Items from list: global/servers: [ { "hostname": "orcl02.example.com", "ip": "192.168.56.55" }, { "hostname": "orcl01.example.com", "ip": "192.168.56.50" }, { "hostname": "orclxe.example.com", "ip": "192.168.56.25" }, { "hostname": "apchop01.example.com", "ip": "192.168.56.30" } ]

jhuckaby commented 1 year ago

Okay thanks, all normal there. I'll have to dig into this when I have some time. That is really a weird bug.

kranskydog commented 1 year ago

OTOH

[root@orcl02 cronicle]# netstat -anp | grep Cronicle tcp6 0 0 :::3012 ::: LISTEN 772/Cronicle Server tcp6 0 0 192.168.56.55:3012 192.168.56.50:27976 ESTABLISHED 772/Cronicle Server udp 0 0 0.0.0.0:3014 0.0.0.0: 772/Cronicle Server

So, it seems because Cronicle is bound to an IPV6 address, anything it gets is going to come from an IPV6 address, so It thinks everything needs to be an IPV6 address https://nodejs.org/dist/latest-v4.x/docs/api/http.html#http_server_listen_port_hostname_backlog_callback

kranskydog commented 1 year ago

setting "server_comm_use_hostnames": true, "web_socket_use_hostnames": true, helps

kranskydog commented 1 year ago

http://www.tcpipguide.com/free/t_IPv6IPv4AddressEmbedding-2.htm

jhuckaby commented 1 year ago

Okay, thank you for all this info. I'll dig in as soon as I have time.

jhuckaby / Cronicle