simagix / hatchet

MongoDB JSON Log Analyzer
Apache License 2.0
165 stars 22 forks source link

Intermittent 'log format not supported' issues #19

Closed TheQueenIsDead closed 1 year ago

TheQueenIsDead commented 1 year ago

Hi, thank you for your work on Keyhole and Hatchet, both tools have proven incredibly valuable for me when identify areas of improvement within Mongo.

I have setup a Github actions pipeline to download server logs periodically using the atlas cli:

atlas logs download <host> mongodb.gz

After downloading the log, I attempt to generate a Hatchet report, and then run the web server to collect data from the api:

# Generate the hatchet, run report build to completion.
hatchet /tmp/mongodb.log.gz
# Once done run the API server.
hatchet -web &

However, every so often the hatchet fails to process a log file at random with the following error:

# 'broken' is the name of the file that I pulled from Github as an artifact
2023/03/28 17:04:00 Downloads/hatchet v0.3.4-20230327
2023/03/28 17:04:00 processing broken-00-01-mongodb.log.gz
2023/03/28 17:04:00 hatchet name is broken_00_01_mongodb_169960
2023/03/28 17:04:00 log format not supported

I understand that this is likely an issue with the file generation, or contents of the file itself, meaning that there is a chance that the issue lies in the Atlas CLI. However, I was wondering if there was any way to get more specific information about this error, as running in verbose mode did not yield any extra information.

Any help appreciated if you happen to have ran into this issue before.

Thanks

Versions

$ hatchet -version
Downloads/hatchet v0.3.4-20230327

$ go version
go version go1.20.2 linux/amd64

$ atlas process list
ID                                                 REPLICA SET NAME       SHARD NAME   VERSION
atlas-xxx-i-00-00.yyy.mongodb.net:27017       atlas-xxx-shard-0                4.4.19
atlas-xxx-shard-00-00.yyy.mongodb.net:27017   atlas-xxx-shard-0                4.4.19
atlas-xxx-shard-00-01.yyy.mongodb.net:27017   atlas-xxx-shard-0                4.4.19
atlas-xxx-shard-00-02.yyy.mongodb.net:27017   atlas-xxx-shard-0                4.4.19
TheQueenIsDead commented 1 year ago

I cloned the project and added a very rudimentary debug line just before the "log format not supported" line in order to debug the data being processed:

$ git diff
diff --git a/logv2.go b/logv2.go
index 2b25f06..4b04eb1 100644
--- a/logv2.go
+++ b/logv2.go
@@ -192,6 +192,7 @@ func (ptr *Logv2) Analyze(filename string) error {
                }
                str := string(buf)
                if !regexp.MustCompile("^{.*}$").MatchString(str) {
+                       fmt.Printf("Log format not supported: %s\n", str)
                        return errors.New("log format not supported")
                }
                break

It looks like this error is potentially caused by an excessively long aggregate query log line, here is the redacted output

{
  "t":{"$date":"2023-03-27T03:00:46.806+00:00"},
  "s":"I",  
  "c":"COMMAND",  
  "id":51803,   
  "ctx":"conn121577",
  "msg":"Slow query",
  "attr":{
    "type":"command",
    "ns":"xxx.yyy",
    "appName":"xxx",
    "command":{
      "aggregate":"yyy",
      "pipeline":[
        {"$match":
          {"aaa_id":"8beeda29-**[...]**-c6bce1cf4c1b",
           "bbb_id":{"$in":["c9651e43-[...]-5ea13e02ba50","4a

I have formatted the above with indentation that was not present, but importantly it shows that the log line was cut off at "4a. The value of bbb_id was a very long array of UUID's

To express how long this line is, copying the new debug line and piping into wc shows that it is ~ 4097 characters long:

echo '<output>' | wc
      1       5    4097
TheQueenIsDead commented 1 year ago

More digging! I whipped up a python script to try find the log line in the main file:

This can be seen here, though it is not important:

Python log parsing script ```python import json with open("broken-00-01-mongodb.log", 'r') as fh: for line in fh.readlines(): try: j = json.loads(line) date = j['t']['$date'] # Narrow down the results if date == '2023-03-27T03:00:46.806+00:00': l = line.strip() start = l.startswith('{') end = l.endswith('}') # Dump info about the line print(l) print(len(l)) print(f"{start=}, {end=}") except Exception as e: if isinstance(e, KeyError): # print("KeyError:", e.args) continue else: # Raise an exception on JSON parse raise(e) ```

What is interesting is that the length of the line that causes issues in hatchet is actually 11350 characters. It parses as valid JSON, and begins and ends with the correct { and } characters.

Delving back into the code I see that the bufio Readline function is used to iterate the file: https://github.com/simagix/hatchet/blob/main/logv2.go#L187

Reading the documentation for that function, it seems that it is possible for the function to return only the beginning of the line, and an isPrefix boolean is returned to denote this. This is currently unassigned in the code, but when I make the following change locally, I can confirm that this flag is being set in the case of my log:

diff --git a/logv2.go b/logv2.go
index 2b25f06..75e7e77 100644
--- a/logv2.go
+++ b/logv2.go
@@ -130,6 +130,7 @@ type HatchetInfo struct {
 func (ptr *Logv2) Analyze(filename string) error {
        var err error
        var buf []byte
+       var prefix bool
        var file *os.File
        var reader *bufio.Reader
        ptr.filename = filename
@@ -184,9 +185,12 @@ func (ptr *Logv2) Analyze(filename string) error {
        }

        for { // check if it is in the logv2 format
-               if buf, _, err = reader.ReadLine(); err != nil { // 0x0A separator = newline
+               if buf, prefix, err = reader.ReadLine(); err != nil { // 0x0A separator = newline
                        return errors.New("no valid log format found")
                }
+               if prefix {
+                       return errors.New("log line too long to parse")
+               }
                if len(buf) == 0 {
                        continue
                }

Running in development mode, the output is such:

2023/03/28 20:21:11 simagix/hatchet devel-20230328
2023/03/28 20:21:11 processing test/broken-00-01-mongodb.log   
2023/03/28 20:21:11 hatchet name is broken_00_01_mongodb_03980e
2023/03/28 20:21:11 log line too long to parse

I'm not super au fait with IO buffers, so I am uncertain how simple this would be to fix