Closed ghost closed 7 years ago
After looking into the code, it seems it's possible for this to sometimes happen if there are multiple messages at exactly the same timestamp and elktail is running in follow mode. I started working on a fix.
Unfortunately, this seems to be more difficult than it appears. Elasticsearch apparently does not guarantee the proper order of log messages that arrive at the same millisecond and it does not support timestamps at the resolution higher than a millisecond. https://github.com/elastic/elasticsearch/issues/10005
I will attempt to resolve this when tailing the log using the algorithm described below: 1) Tailing the log works by issuing query-by-query to the elasticsearch, each query continues from the timestamp of the last log entry from the previous query 2) After query is completed and entries outputted, check the last log entries and detect if at the end of the results there were multiple log entries sharing the same timestamp, if yes, store them 3) Invoke next query, continuing (and including) the timestamp of previous last log entry 4) Remove entries stored in step 2) from the results and continue the algorithm from step 2)
@stefanheine could you test if the latest version that's committed to the repo fixes the issue of missing log entries for you?
I installed elktail using: go get github.com/knes1/elktail and then started the newly built version.
I can still reproduce the issue in that version.
that binary is still reporting its version as 0.1.5. do I need to do something else to get to 0.1.6?
I suspect the missing entries happen because some logs are "arriving late" to the elastic (e.g. elktail has already printed and went past the timestamp when an log entry with that timestamp gets processed by elastic). If that's an issue, it could be alleviated by introducing a bit of delay to give elastic time to process the newly arrived logs.
@stefanheine could you describe how you collect the logs in your system? Are they collected by logstash from a single file? Are there multiple hosts involved? Do hosts have their clocks synchronized? If you limit the query to only a single log file/log source, does the issue still happen?
Here is how to reproduce it:
app.js:
'use strict';
// https://www.npmjs.com/package/flaschenpost
const flaschenpost = require('flaschenpost');
const logger = flaschenpost.getLogger();
logger.info('0');
logger.info('1');
logger.info('2');
logger.info('3');
logger.info('4');
logger.info('5');
logger.info('6');
logger.info('7');
logger.info('8');
logger.info('9');
logger.info('10');
logger.info('11');
logger.info('12');
logger.info('13');
logger.info('14');
logger.info('15');
logger.info('16');
logger.info('17');
logger.info('18');
logger.info('19');
package.json:
{
"name": "elktail_test",
"version": "0.0.0",
"description": "",
"main": "app.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"flaschenpost": "0.14.2"
}
}
install:
npm install
in one window, run:
elktail application.name:elktail_test
in the other window, run:
ELASTIC_URL=<your patch to elastic>
node app.js | node_modules/.bin/flaschenpost-normalize | node_modules/.bin/flaschenpost-to-elastic
usually, you'll see output like:
2016-11-14T06:50:29.821Z elktail_test info 1 %metadata 2016-11-14T06:50:29.824Z elktail_test info 4 %metadata 2016-11-14T06:50:29.824Z elktail_test info 2 %metadata 2016-11-14T06:50:29.824Z elktail_test info 3 %metadata 2016-11-14T06:50:29.824Z elktail_test info 5 %metadata [...]
try this a few times, and you should see situations where only a subset of the messages are displayed by elktail.
Thank you for the test app. I have successfully reproduced the problem. Unfortunately, it seems the solution is tricky. It seems that, due to distributed nature of the elasticsearch, inbound log messages could be indexed (stored) by different shards or even nodes and then some newer log messages could actually be stored sooner than some older ones. For this reason the older messages that were indexed "slowly" could end up missing from the output when tailing the logs.
I have resolved this by using a "query time window". Elktail used to work by issuing requests to elasticsearch to fetch log messages that are newer than the latest log message it received previously. Now it also queries for older messages it has not seen yet within specified time window. Unfortunately there are 2 downsides of this approach:
1) There's no exact way to determine what should a good value for time window be - it will depend on inbound log rate and elasticsearch's processing power 2) Messages that were missed in previous version of the algorithm will now appear, however, they will appear out of order
I will test this algorithm a bit more and then issue an update. If anyone has ideas how to resolve the above mentioned issues using a different approach, please chime in
@stefanheine I released a new version (0.1.7), could you do another test see if it resolves your issue?
I tested the new version 0.1.7 and it works ok for me, not dropping any messages any more.
thanks for the fix!
I have a case, where elktail is losing messages. my application is sending a bulk of messages into elastic, and I can verify that they arrive (kibana).
at the same time I have elktail running as:
elktail application.name:\"@hpe/quixy\"
the good output would be:
sometimes some lines are lost, like (note,
Registered route get:/quixy/quixy
is missing):when I add more filters, like in:
elktail application.name:\"@hpe/audit-server\" OR application.name:\"@hpe/source\" OR application.name:\"@hpe/hellfire-support\" OR application.name:\"@hpe/quixy\"
I see more cases of losing messages, so it seems like the filter handling is related to this issue.