hobinjk / gateway-prometheus-translator

Translator (export) from the Web Thing API to Prometheus metrics
Mozilla Public License 2.0
12 stars 2 forks source link

RUNNING THE TRANSLATOR AS A SERVICE PROBLEMS #7

Open davidhitchins opened 3 years ago

davidhitchins commented 3 years ago

When the TRANSLATOR is started as a service automatically on boot up it runs (says it is active - see below) but in fact no webpage at port 3600 is produced. Then if you stop the service and restart it manually it works fine! (see further below) The error message is: UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The error seems to have its origin in the required EXPRESS library as there are no ASYNC funtions in tranlator.js that I can see(?) But the MYSTERY is why the error doesn't occur when I start the service manually. Isn't it the same sequence of processes?

Any suggestions or clues would really help. Thanks.

pi@xxxxxxxxxxxxx:~ $ sudo systemctl status mozilla-iot-gateway-prometheus-translator ● mozilla-iot-gateway-prometheus-translator.service - Mozilla IoT Gateway to Prometheus Translator Loaded: loaded (/etc/systemd/system/mozilla-iot-gateway-prometheus-translator.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-08-13 10:26:36 AEST; 2min 21s ago Main PID: 561 (translator.sh) Tasks: 12 (limit: 4915) CGroup: /system.slice/mozilla-iot-gateway-prometheus-translator.service ├─561 /bin/bash /home/pi/mozilla-iot/gateway-prometheus-translator/translator.sh └─566 node translator.js

Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at TLSSocket.socketErrorListener (_http_client.js:401:9) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at TLSSocket.emit (events.js:198:13) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at errorOrDestroy (internal/streams/destroy.js:107:12) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at onwriteError (_stream_writable.js:436:5) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at onwrite (_stream_writable.js:461:5) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at _destroy (internal/streams/destroy.js:49:7) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at TLSSocket.Socket._destroy (net.js:614:3) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: at TLSSocket.destroy (internal/streams/destroy.js:37:8) Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: (node:566) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function withou Aug 13 10:26:44 xxxxxxxxxxxxx translator.sh[561]: (node:566) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will ter

pi@xxxxxxxxxxxxx:~ $ sudo systemctl stop mozilla-iot-gateway-prometheus-translator pi@xxxxxxxxxxxxx:~ $ sudo systemctl start mozilla-iot-gateway-prometheus-translator pi@xxxxxxxxxxxxx:~ $ sudo systemctl status mozilla-iot-gateway-prometheus-translator ● mozilla-iot-gateway-prometheus-translator.service - Mozilla IoT Gateway to Prometheus Translator Loaded: loaded (/etc/systemd/system/mozilla-iot-gateway-prometheus-translator.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-08-13 10:29:53 AEST; 7s ago Main PID: 1222 (translator.sh) Tasks: 12 (limit: 4915) CGroup: /system.slice/mozilla-iot-gateway-prometheus-translator.service ├─1222 /bin/bash /home/pi/mozilla-iot/gateway-prometheus-translator/translator.sh └─1223 node translator.js

Aug 13 10:29:53 xxxxxxxxxxxxx systemd[1]: Started Mozilla IoT Gateway to Prometheus Translator. Aug 13 10:29:55 xxxxxxxxxxxxx translator.sh[1222]: Translator listening to 3060, metrics exposed on /metrics endpoint pi@xxxxxxxxxxxxx:~ $

davidhitchins commented 3 years ago

Still trying to figure this one out. Any ideas please @hobinjk ? thanks

hobinjk commented 3 years ago

Thank you for your interest in the project! I've never encountered this particular bug before. I think it's possible that there's a bit of a race condition during startup. It might be possible to fix this by modifying install.sh to specify network-online.target instead of just network.target: https://github.com/hobinjk/gateway-prometheus-translator/blob/master/install.sh#L9

davidhitchins commented 3 years ago

We're getting there... I tried your suggestion @hobinjk but it made no difference. The weird part is that the service successfully loads and becomes active on startup but does not seem to generate the webpage at port 3600/metrics. I wondered if using PM2 as a startup manager instead of plain systemd might solve the problem. (load balancing, race conditions, etc) It didn't, but one thing it showed was that the metrics WERE supposed to be exposed on port 3600 according to the monitoring window generated by the command pm2 monit. (see attached screenshot). Then I realised that the page at ****:3600/metrics IS IN FACT being generated but that it is a blank page. When the server is stopped you get a ' PAGE NOT FOUND' message, not a blank page. This suggests something is going wrong with extracting the data from the Web of Things API (with the web socket or something like that?) I always get that pesky "unhandled promise rejection' message when the translator is started this way. Which script is producing this error message? It's a subtle one because if you stop then start the service manually this error goes away and all works perfectly (as I said before). What CAN it be? Fullscreen capture 19082020 101827 AM bmp

davidhitchins commented 3 years ago

NEW ISSUE. Is there some way to implement logging to find out why this is happening @hobinjk ?

Today translator.js stopped updating the page at port 3060/metrics but was shown by pm2 to still be up and working.

pi@xxxxxxxxxxxxx:~ $ pm2 list ┌─────┬───────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐ │ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │ ├─────┼───────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤ │ 0 │ translator │ default │ 0.0.1 │ fork │ 1356 │ 17h │ 2 │ online │ 0% │ 48.4mb │ pi │ disabled │ └─────┴───────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘ RESTARTED THE SERVICE pi@xxxxxxxxxxxxx:~ $ pm2 restart /home/pi/mozilla-iot/gateway-prometheus-translator/translator.js Use --update-env to update environment variables [PM2] Applying action restartProcessId on app [/home/pi/mozilla-iot/gateway-prometheus-translator/translator.js](ids: 0) [PM2] translator ✓ ┌─────┬───────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐ │ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │ ├─────┼───────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤ │ 0 │ translator │ default │ 0.0.1 │ fork │ 16925 │ 0s │ 3 │ online │ 0% │ 19.5mb │ pi │ disabled │ └─────┴───────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘ ON RESTART translator was updating again.

So if the translator script service is still running it doesn't mean that it is doing its job and updating the page. How to debug to find out what is wrong please? I feel that this new issue might have the same cause as the first one above.

hobinjk commented 3 years ago

console.log and related printing statements in translator.js and other files will show up in journalctl (or pm2's equivalent log system)

davidhitchins commented 3 years ago

Ok so I restarted the system with PM2 set to startup translator.js automatically then I read the following errors in the PM2 logs. (see below) Seems that an error occurs in node-fetch.js at line 1393 reject(new FetchError(request to ${request.url} failed, reason: ${err.message}, 'system', err)); But on stopping the service and restarting it again with PM2 manually, no error occurs at all. Any clues what to do please @hobinjk ? Is it something to do with systemd processes not having the necessary SSL permissions? (error 49 access denied)

LOG FROM PM2

0|translat | 2020-08-21T10:15:48: { FetchError: request to https://xxxxxxxxxxxxx.mozilla-iot.org/things failed, reason: write EPROTO 3066018256:error:14094419:SSL routines:ssl3_read_bytes:tlsv1 alert access denied:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1544:SSL alert number 49 0|translat | 2020-08-21T10:15:48: 0|translat | 2020-08-21T10:15:48: at ClientRequest. (/home/pi/mozilla-iot/gateway-prometheus-translator/node_modules/node-fetch/lib/index.js:1393:11) 0|translat | 2020-08-21T10:15:48: at ClientRequest.emit (events.js:198:13) 0|translat | 2020-08-21T10:15:48: at TLSSocket.socketErrorListener (_http_client.js:401:9) 0|translat | 2020-08-21T10:15:48: at TLSSocket.emit (events.js:198:13) 0|translat | 2020-08-21T10:15:48: at errorOrDestroy (internal/streams/destroy.js:107:12) 0|translat | 2020-08-21T10:15:48: at onwriteError (_stream_writable.js:436:5) 0|translat | 2020-08-21T10:15:48: at onwrite (_stream_writable.js:461:5) 0|translat | 2020-08-21T10:15:48: at _destroy (internal/streams/destroy.js:49:7) 0|translat | 2020-08-21T10:15:48: at TLSSocket.Socket._destroy (net.js:614:3) 0|translat | 2020-08-21T10:15:48: at TLSSocket.destroy (internal/streams/destroy.js:37:8) 0|translat | 2020-08-21T10:15:48: message: 0|translat | 2020-08-21T10:15:48: 'request to https://xxxxxxxxxxxxx.mozilla-iot.org/things failed, reason: write EPROTO 3066018256:error:14094419:SSL routines:ssl3_read_bytes:tlsv1 alert access denied:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1544:SSL alert number 49\n', 0|translat | 2020-08-21T10:15:48: type: 'system', 0|translat | 2020-08-21T10:15:48: errno: 'EPROTO', 0|translat | 2020-08-21T10:15:48: code: 'EPROTO' }

davidhitchins commented 3 years ago

I changed gateway to a local address to 192.168.x.xxx:8080 in translator.js (This line: const yourGatewayUrl = 'https://192.168.x.xxx:8080';) rather than https://*******.mozilla-iot.org because if the local wifi network goes down then the websocket will not reconnect (probably due to a timeout while trying to find mozilla-iot.org on the external web) and so the 3060/metrics page doesn't update anymore.

Rebooted. Still no luck but now, at least, an interruption to the local wifi network (turned modem off then on again) doesn't permanently stop the translator from working. Once the network is up again everything is sweet.

Here are the PM2 app logs after rebooting (still not gathering metrics from gateway)

0|translat | 2020-08-22T12:30:25: { FetchError: request to http://192.168.x.xxx:8080/things failed, reason: connect ECONNREFUSED 192.168.x.xxx:8080 0|translat | 2020-08-22T12:30:25: at ClientRequest. (/home/pi/mozilla-iot/gateway-prometheus-translator/node_modules/node-fetch/lib/index.js:1393:11) 0|translat | 2020-08-22T12:30:25: at ClientRequest.emit (events.js:198:13) 0|translat | 2020-08-22T12:30:25: at Socket.socketErrorListener (_http_client.js:401:9) 0|translat | 2020-08-22T12:30:25: at Socket.emit (events.js:198:13) 0|translat | 2020-08-22T12:30:25: at emitErrorNT (internal/streams/destroy.js:91:8) 0|translat | 2020-08-22T12:30:25: at emitErrorAndCloseNT (internal/streams/destroy.js:59:3) 0|translat | 2020-08-22T12:30:25: at process._tickCallback (internal/process/next_tick.js:63:19) 0|translat | 2020-08-22T12:30:25: message: 0|translat | 2020-08-22T12:30:25: 'request to http://192.168.x.xxx:8080/things failed, reason: connect ECONNREFUSED 192.168.x.xxx:8080', 0|translat | 2020-08-22T12:30:25: type: 'system', 0|translat | 2020-08-22T12:30:25: errno: 'ECONNREFUSED', 0|translat | 2020-08-22T12:30:25: code: 'ECONNREFUSED' }

davidhitchins commented 3 years ago

Hello @hobinjk . I am still battling with getting the translator service to start up automatically but keep getting the ECONNREFUSED message above. As I said, when I start the service manually (not from systemd) it works fine. But systemd just doesn't work. I found a similar situation for another javascript app on GitHub! https://github.com/GoogleChrome/lighthouse/issues/3275#issue-256169788 The situation is identical to mine except it's about another app. Any ideas how to adapt their solution to this current case? Thanks in advance.

davidhitchins commented 3 years ago

Well, after two weeks of trying to get the translator service to work on startup, I found a way. I came across it quite by accident. Remember I am not a coder but I do have a BIT of a clue. It seems it was simply the fact that translator.js is launched too soon: BEFORE the Mozilla Gateway server is UP and somewhere (not sure where) a timeout occurs resulting in the http connection to the server never being established and therefore never being upgraded to a operational web socket. The translator service is running a good minute before the gateway is up I noticed. So my workaround was simply to delay the startup of this service by inserting this line ExecStartPre=/bin/sleep 60 into the unit file before the ExecStart line. With this change it all works fine. But it strikes me as a rather dirty sort of hack. Shouldn't there be some more elegant way of handling this delay in the Javascript code? Any suggestions please @hobinjk ? [Unit] Description=Mozilla IoT Gateway to Prometheus Translator After=network-online.target

[Service] Type=simple StandardOutput=journal StandardError=journal User=root WorkingDirectory=/home/pi/mozilla-iot/gateway-prometheus-translator

Make the translator service wait 60 seconds so Mozilla Gateway is sure to be up

ExecStartPre=/bin/sleep 60

Edit this line, if needed, to set the correct path to node

ExecStart=node /home/pi/mozilla-iot/gateway-prometheus-translator/translator.js Restart=always RestartSec=10s

[Install] WantedBy=multi-user.target

hobinjk commented 3 years ago

I thought any proper fix would be difficult but your latest message finally gave me the flash of inspiration required to see a change that should work. I've pushed a simple commit which exits the server when the initial fetch fails. This should trigger pm2/systemd to restart the server every 10 seconds until it connects successfully.

Let me know if this doesn't work for you since I currently don't have a development environment set up to test this.

davidhitchins commented 3 years ago

BINGO! That works fine, @hobinjk , and fast. Just three little lines... 🥇

davidhitchins commented 3 years ago

OH DEAR. Don't get too excited. While it worked fine on my RPI4B I then set up another system on an RPI3 and there is still a problem. The translator now HALF WORKS on the RP3! On reboot the translator has an unexpected token error. But if you stop and start the service manually this error goes away and all is fine. See attach log file. Thanks. RPI3 problem with gateway prometheus translator.txt HOWEVER if I use the "dirty hack" above (putting ExecStartPre=/bin/sleep 60 in the UNIT file). all is well again. It really seems to be that the translator is still getting launched too soon for some processes there.