jhuckaby / Cronicle

A simple, distributed task scheduler and runner with a web based UI.
http://cronicle.net
Other
3.82k stars 387 forks source link

get_job_status fails if it connects to backup master #665

Open kranskydog opened 11 months ago

kranskydog commented 11 months ago

Summary

the get_job_status API fails if it happens to hit the backup master in multi-server config . This does not happen with other API calls

Steps to reproduce the problem

1) create multi server setup with primary and backup masters 2) call get_job_status API to primary master - works 3) call get_job_status API to backup master - fails with

image

Your Setup

Virtualbox Cronicle 0.9.38 2 master servers (primary, backup)- access via round-robin DNS (Oracle SCAN IPs) to virtual hostname conf/config.json has "web_direct_connect": true, 2 other worker servers can connect to web console via virtual hostname and everything works as expected. Can use other APIs against both master nodes and they work correctly ie image image

Operating system and version?

[root@orcl01 ~]# cat /etc/oracle-release Oracle Linux Server release 7.9 [root@orcl01 ~]# uname -a Linux orcl01.example.com 5.4.17-2136.324.5.3.el7uek.x86_64 #2 SMP Tue Oct 10 12:44:19 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux

Node.js version?

v16.20.2

Cronicle software version?

0.9.38

Are you using a multi-server setup, or just a single server?

Multi

Are you using the filesystem as back-end storage, or S3/Couchbase?

filesystem (cluster)

Can you reproduce the crash consistently?

yes

Log Excerpts

Can't see anything specific

jhuckaby commented 11 months ago

Okay, so, here is the thing. The get_job_status API is actually working as designed. This API only works on the master node. If you hit a backup node, it returns a HTTP 302 redirect over to the master. This is explained in the docs here:

https://github.com/jhuckaby/Cronicle/blob/master/docs/APIReference.md#redirects

I cannot explain why you are seeing that weird "protocol violation" error, or where that is even coming from. Some kind of proxy server you have in the middle, which isn't expecting a HTTP 302? Dunno.

Anyway, here is the thing. The get_history API, which you cite as an example of something working correctly, is actually not 😝 . That API is failing to check if the current server is master before running, which is a bug.

I will fix that.

kranskydog commented 11 months ago

Hmmmm [apache@apchop01 ~]$ wget "http://orcl02.example.com:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02"--2023-11-02 10:49:07-- http://orcl02.example.com:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02 Resolving orcl02.example.com (orcl02.example.com)... 192.168.56.55 Connecting to orcl02.example.com (orcl02.example.com)|192.168.56.55|:3012... connected. HTTP request sent, awaiting response... 302 Found Location: http://::ffff:192.168.56.50:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02 [following] http://::ffff:192.168.56.50:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02: Invalid host name.

IPV6?

kranskydog commented 11 months ago

[apache@apchop01 ~]$ curl -v -L "http://orcl02.example.com:3012/api/app/get_event/v1/?api_key=a44e89551e0232b8e7aab002147c357e&id=elo3mvp8h02"

jhuckaby commented 11 months ago

Okay, that is really bizarre. Your backup server thinks that the master server's IP address is ::ffff:192.168.56.50. I've never seen that before.

What does your server data look like? Try:

/opt/cronicle/bin/storage-cli.js list_get global/servers

Are the IPs munged in there as well? I'm still trying to fathom how this could possibly have happened.

kranskydog commented 11 months ago

[root@orcl02 cronicle]# /opt/cronicle/bin/storage-cli.js list_get global/servers Got 4 items. Items from list: global/servers: [ { "hostname": "orcl02.example.com", "ip": "192.168.56.55" }, { "hostname": "orcl01.example.com", "ip": "192.168.56.50" }, { "hostname": "orclxe.example.com", "ip": "192.168.56.25" }, { "hostname": "apchop01.example.com", "ip": "192.168.56.30" } ]

jhuckaby commented 11 months ago

Okay thanks, all normal there. I'll have to dig into this when I have some time. That is really a weird bug.

kranskydog commented 11 months ago

OTOH

[root@orcl02 cronicle]# netstat -anp | grep Cronicle tcp6 0 0 :::3012 ::: LISTEN 772/Cronicle Server tcp6 0 0 192.168.56.55:3012 192.168.56.50:27976 ESTABLISHED 772/Cronicle Server udp 0 0 0.0.0.0:3014 0.0.0.0: 772/Cronicle Server

So, it seems because Cronicle is bound to an IPV6 address, anything it gets is going to come from an IPV6 address, so It thinks everything needs to be an IPV6 address https://nodejs.org/dist/latest-v4.x/docs/api/http.html#http_server_listen_port_hostname_backlog_callback image

kranskydog commented 11 months ago

setting "server_comm_use_hostnames": true, "web_socket_use_hostnames": true, helps

kranskydog commented 11 months ago

http://www.tcpipguide.com/free/t_IPv6IPv4AddressEmbedding-2.htm

jhuckaby commented 11 months ago

Okay, thank you for all this info. I'll dig in as soon as I have time.