knes1 / elktail

Command line utility to query, search and tail EL (elasticsearch, logstash) logs
MIT License
197 stars 66 forks source link

elktail losing messages #11

Closed ghost closed 7 years ago

ghost commented 8 years ago

I have a case, where elktail is losing messages. my application is sending a bulk of messages into elastic, and I can verify that they arrive (kibana).

at the same time I have elktail running as: elktail application.name:\"@hpe/quixy\"

the good output would be:

2016-11-03T08:15:54.841Z info @hpe/quixy ---------------- Startup ---------------- %metadata 2016-11-03T08:15:54.848Z info @hpe/quixy Registered static route /quixy/css for /home/stheine/js/quixy/css %metadata 2016-11-03T08:15:54.864Z info @hpe/quixy Registered route get:/quixy/queryActiveOwner %metadata 2016-11-03T08:15:54.865Z info @hpe/quixy Registered route get:/quixy/test %metadata 2016-11-03T08:15:54.865Z info @hpe/quixy Registered route get:/quixy/quixy %metadata 2016-11-03T08:15:54.866Z info @hpe/quixy Registered route get:/quixy/view %metadata 2016-11-03T08:15:54.954Z info @hpe/quixy Server type: dev url: http://suotrddocker001g.hpeswlab.net:7195 ext: http://localhost/quixy %metadata

sometimes some lines are lost, like (note, Registered route get:/quixy/quixy is missing):

2016-11-03T08:16:35.706Z info @hpe/quixy ---------------- Startup ---------------- %metadata 2016-11-03T08:16:35.714Z info @hpe/quixy Registered static route /quixy/css for /home/stheine/js/quixy/css %metadata 2016-11-03T08:16:35.729Z info @hpe/quixy Registered route get:/quixy/queryActiveOwner %metadata 2016-11-03T08:16:35.730Z info @hpe/quixy Registered route get:/quixy/test %metadata 2016-11-03T08:16:35.732Z info @hpe/quixy Registered route get:/quixy/view %metadata 2016-11-03T08:16:35.823Z info @hpe/quixy Server type: dev url: http://suotrddocker001g.hpeswlab.net:7195 ext: http://localhost/quixy %metadata

when I add more filters, like in:

elktail application.name:\"@hpe/audit-server\" OR application.name:\"@hpe/source\" OR application.name:\"@hpe/hellfire-support\" OR application.name:\"@hpe/quixy\"

I see more cases of losing messages, so it seems like the filter handling is related to this issue.

knes1 commented 8 years ago

After looking into the code, it seems it's possible for this to sometimes happen if there are multiple messages at exactly the same timestamp and elktail is running in follow mode. I started working on a fix.

knes1 commented 8 years ago

Unfortunately, this seems to be more difficult than it appears. Elasticsearch apparently does not guarantee the proper order of log messages that arrive at the same millisecond and it does not support timestamps at the resolution higher than a millisecond. https://github.com/elastic/elasticsearch/issues/10005

I will attempt to resolve this when tailing the log using the algorithm described below: 1) Tailing the log works by issuing query-by-query to the elasticsearch, each query continues from the timestamp of the last log entry from the previous query 2) After query is completed and entries outputted, check the last log entries and detect if at the end of the results there were multiple log entries sharing the same timestamp, if yes, store them 3) Invoke next query, continuing (and including) the timestamp of previous last log entry 4) Remove entries stored in step 2) from the results and continue the algorithm from step 2)

knes1 commented 8 years ago

@stefanheine could you test if the latest version that's committed to the repo fixes the issue of missing log entries for you?

ghost commented 8 years ago

I installed elktail using: go get github.com/knes1/elktail and then started the newly built version.

I can still reproduce the issue in that version.

that binary is still reporting its version as 0.1.5. do I need to do something else to get to 0.1.6?

knes1 commented 7 years ago

I suspect the missing entries happen because some logs are "arriving late" to the elastic (e.g. elktail has already printed and went past the timestamp when an log entry with that timestamp gets processed by elastic). If that's an issue, it could be alleviated by introducing a bit of delay to give elastic time to process the newly arrived logs.

@stefanheine could you describe how you collect the logs in your system? Are they collected by logstash from a single file? Are there multiple hosts involved? Do hosts have their clocks synchronized? If you limit the query to only a single log file/log source, does the issue still happen?

ghost commented 7 years ago

Here is how to reproduce it:

app.js:

'use strict';

// https://www.npmjs.com/package/flaschenpost
const flaschenpost = require('flaschenpost');

const logger = flaschenpost.getLogger();

logger.info('0');
logger.info('1');
logger.info('2');
logger.info('3');
logger.info('4');
logger.info('5');
logger.info('6');
logger.info('7');
logger.info('8');
logger.info('9');
logger.info('10');
logger.info('11');
logger.info('12');
logger.info('13');
logger.info('14');
logger.info('15');
logger.info('16');
logger.info('17');
logger.info('18');
logger.info('19');

package.json:

{
  "name": "elktail_test",
  "version": "0.0.0",
  "description": "",
  "main": "app.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "flaschenpost": "0.14.2"
  }
}

install: npm install

in one window, run: elktail application.name:elktail_test

in the other window, run:

ELASTIC_URL=<your patch to elastic>
node app.js | node_modules/.bin/flaschenpost-normalize | node_modules/.bin/flaschenpost-to-elastic

usually, you'll see output like:

2016-11-14T06:50:29.821Z elktail_test info 1 %metadata 2016-11-14T06:50:29.824Z elktail_test info 4 %metadata 2016-11-14T06:50:29.824Z elktail_test info 2 %metadata 2016-11-14T06:50:29.824Z elktail_test info 3 %metadata 2016-11-14T06:50:29.824Z elktail_test info 5 %metadata [...]

try this a few times, and you should see situations where only a subset of the messages are displayed by elktail.

knes1 commented 7 years ago

Thank you for the test app. I have successfully reproduced the problem. Unfortunately, it seems the solution is tricky. It seems that, due to distributed nature of the elasticsearch, inbound log messages could be indexed (stored) by different shards or even nodes and then some newer log messages could actually be stored sooner than some older ones. For this reason the older messages that were indexed "slowly" could end up missing from the output when tailing the logs.

I have resolved this by using a "query time window". Elktail used to work by issuing requests to elasticsearch to fetch log messages that are newer than the latest log message it received previously. Now it also queries for older messages it has not seen yet within specified time window. Unfortunately there are 2 downsides of this approach:

1) There's no exact way to determine what should a good value for time window be - it will depend on inbound log rate and elasticsearch's processing power 2) Messages that were missed in previous version of the algorithm will now appear, however, they will appear out of order

I will test this algorithm a bit more and then issue an update. If anyone has ideas how to resolve the above mentioned issues using a different approach, please chime in

knes1 commented 7 years ago

@stefanheine I released a new version (0.1.7), could you do another test see if it resolves your issue?

ghost commented 7 years ago

I tested the new version 0.1.7 and it works ok for me, not dropping any messages any more.

thanks for the fix!