Open freehck opened 1 year ago
Workaround
As splitting to small time ranges eliminates the problem, I wrote a bash workaround using the Monte Carlo method.
############### BEGIN OF MONTE CARLO METHOD #################
# time functions
export TZ=UTC
function date2seconds() {
date -d"$1" +%s
}
function seconds2iso8601date() {
date -Iseconds -d@"$1"
}
# checksum calculation
function md5sum_hashonly() {
md5sum "$1" | cut -d' ' -f1
}
# logcli function
function logcli-wrapper() {
logcli query --quiet --forward --limit=0 --org-id=ORG_ID --addr=LOKI_ADDR -o raw "$@"
}
function logcli-monte-carlo-rec() {
#### PREAMBULA ####
# input parameters
local query="$1" from_ts="$2" to_ts="$3" logfile="$4" ranges="${5-24}" checksum="${6-}"
# defaults
local convergence_coefficient=6
# calculations
local range_length_seconds=$(( to_ts - from_ts ))
local range_interval=$(( range_length_seconds / ranges ))
# temp vars
local range_from_ts range_to_ts range_from_date range_to_date
local temp_log_file new_checksum
#### CODE ####
# make temp file to store log
temp_log_file=$(mktemp)
trap "rm -f $temp_log_file" err int exit
# calculate ranges for chunks
for range_from_ts in $(seq $from_ts $range_interval $to_ts); do
if [ "$range_from_ts" -ge "$to_ts" ]; then break; fi
range_to_ts=$(( range_from_ts + range_interval ))
if [ "$range_to_ts" -gt "$to_ts" ]; then range_to_ts=$to_ts; fi
range_from_date=$(seconds2iso8601date $range_from_ts)
range_to_date=$(seconds2iso8601date $range_to_ts)
# download logs by chunks
logcli-wrapper "$query" --from="$range_from_date" --to="$range_to_date" >>$temp_log_file
done
# compare checksums
new_checksum=$(md5sum_hashonly $temp_log_file)
if [ "$new_checksum" = "$checksum" ]; then
# move to the appropriate place when success
mv "$temp_log_file" "$logfile"
else
# remove garbage and start a new iteration when failure
rm -f "$temp_log_file"
logcli-monte-carlo-rec "$query" "$from_ts" "$to_ts" "$logfile" "$(( ranges * convergence_coefficient ))" "$new_checksum"
fi
}
function logcli-monte-carlo() {
local query="$1" from_date="$2" to_date="$3" logfile="$4" ranges="${5-24}"
logcli-monte-carlo-rec "$query" "$(date2seconds $from_date)" "$(date2seconds $to_date)" "$logfile" "$ranges"
}
############### END OF MONTE CARLO METHOD #################
Usage:
export TZ=UTC
logcli-monte-carlo <QUERY> <ISODATE_FROM> <ISODATE_TO> <LOGFILE> [<HOW_MANY_RANGES>]
Example:
export TZ=UTC
logcli-monte-carlo '{pod="POD_NAME"}' "2023-09-01T00:00:00" "2023-09-02T00:00:00" "log"
This code is well-tested for 1-day queries. By default it splits the range to 24 smaller ranges, every next iteration multiplies the number of ranges by 6, then check if the checksum of the collected log's not changed, and if it's not, save the log file in the specified place.
NB:
1) Code doesn't cotnain logcli's non-zero return codes checks. Run it strictly with set -e
.
2) You must have tzdata
package installed. Also explicit export of TZ
variable is strongly recommended.
3) If you query a time range wider than 1 day, or if your query labels must return a larger amount of data than 350mb (my case) -- increase the number of ranges to split in advance (using the 5th parameter of logcli-monte-carlo
).
4) Never set the number of ranges to small numbers. Monte Carlo method works well only on big numbers.
I think I see the same problem. When I query a 7 day window, I see one log line, when I split it up into 3 2 day windows, then I see two log lines.
I am on version:
$ /usr/bin/loki --version
loki, version 2.8.5 (branch: HEAD, revision: 03cd6c82b)
build user: root@bdb1fa196fd7
build date: 2023-09-14T17:17:19Z
go version: go1.20.7
platform: linux/amd64
Experiencing the same problem.
Big important update. This doesn't happen with the following loki configuration:
loki:
limits_config:
split_queries_by_interval: 0
Probably it'd be a sane default for loki until this bug's fixed.
Describe the bug If a query has a wide time range some log lines can be missed. Manual split of queries to smaller time ranges eliminates the problem. I think it's a problem of Loki Querier.
To Reproduce I managed to reproduce the problem only on a heavy loaded production infrastructure. Here will be a lot of bash code, so take a deep breath.
First of all let's define an additional function to simplify a bit the following code reading:
Now here's a bunch of queries taking the log for one specific day (24h): 1) direct way to query a full day log in one pass:
2) manual split to query the same range by hours: i.e. 24 queries for the same specific day (24h):
3) manual split to query the same range by smaller ranges of 10 minutes only: i.e. 24*6 queries for the same specific day (24h):
Here're the results for these three commands:
As you can see, the log
full-day.log
returned by the first query has 679 lines (or 218.49kb) less thanfull-day-hours.log
orfull-day-10mins.log
returned by split (second and third) queries . It's about 0.055% of the full amount of log, but for B2B purposes even such a small percent of missed logs is a problem.A visual analysis of the differences of the collected logs shows that the missed lines are not in the beginning or in the end, but everywhere. F.e. the lines I was looking for were actually in the middle of the day.
Expected behavior I expect a direct query of a full time range to return all the log lines, but in fact I see the difference between
full-day.log
andfull-day-hours.log
.Environment: We use ha-loki installation in Kubernetes with Minio backend.
80-loki.tf
helm-loki-values.yaml