google / mtail

extract internal monitoring data from application logs for collection in a timeseries database
Apache License 2.0
3.86k stars 378 forks source link

Runtime errors should be more visible to users #221

Open jakirpatel opened 5 years ago

jakirpatel commented 5 years ago

Details : Problem : Metrics are not getting exposed.

Mtail version: mtail version v3.0.0-rc16 git revision 77eeba903db0ffd90e29c1ee778fdefb0112bfab go version go1.11

Filesystem: NFS

Logs from the mtail:

I0325 01:19:17.267495   14907 file.go:171] /home/coherence_app_log/jmxreports/2019032510-memory-status.txt: file type does not support deadline
I0325 01:19:17.267802   14907 tail.go:246] Tailing /home/coherence_app_log/jmxreports/2019032510-memory-status.txt
I0325 01:19:17.267993   14907 file.go:112] open succeeded /home/coherence_app_log/jmxreports/2019032510-memory-status.txt
I0325 01:19:17.268001   14907 tail.go:236] Adding a file watch on "/home/coherence_app_log/jmxreports/2019032510-memory-status.txt"
I0325 01:19:17.268005   14907 log_watcher.go:277] Resolved path for lookup "/home/coherence_app_log/jmxreports/2019032510-memory-status.txt"
I0325 01:19:17.590447   14907 log_watcher.go:174] No modtime change for /home/coherence_app_log/jmxreports/2019032510-memory-status.txt, no send
I0325 01:19:17.820442   14907 file.go:171] /home/coherence_app_log/jmxreports/2019032510-memory-status.txt: file type does not support deadline

Log type:

1962    ?? 3 25 10:16:32 JST 2019   597277688   246 ParNew  2243    1.0 58409   18.0    597137142   19.0    4093640704  4294967296  4093640704  1635710024
1962    ?? 3 25 10:16:32 JST 2019   597273217   247 ParNew  2237    2.0 58401   37.0    597176319   17.0    4093640704  4294967296  4093640704  1252907928
1962    ?? 3 25 10:16:32 JST 2019   597270204   248 ParNew  2242    1.0 58623   15.0    597112242   15.0    4093640704  4294967296  4093640704  1715023800
1962    ?? 3 25 10:16:32 JST 2019   597267193   249 ParNew  2232    2.0 58527   37.0    597159391   18.0    4093640704  4294967296  4093640704  1267338656
1962    ?? 3 25 10:16:32 JST 2019   597262215   250 ParNew  2228    1.0 58006   19.0    597092171   19.0    4093640704  4294967296  4093640704  1730664056
1962    ?? 3 25 10:16:32 JST 2019   597258907   251 ParNew  2233    1.0 57737   23.0    597098542   23.0    4093640704  4294967296  4093640704  1639189488
1962    ?? 3 25 10:16:32 JST 2019   597255899   252 ParNew  2242    1.0 58491   17.0    597095164   16.0    4093640704  4294967296  4093640704  1697487040

Mtail program:

counter total_heap_committed by node_id, gc_name
counter total_heap_used by node_id, gc_name
counter total_heap_max by node_id, gc_name
counter total_heap_init by node_id, gc_name

def coherenceMemoryStatus {
    /[\d]+\s[\?]+\s[\d]+\s[\d]+\s[\d]+:[\d]+:[\d]+\s[A-Za-z]+\s[\d]+\s[\d]+\s(?P<node_id>[\d]+)\s(?P<gc_name>[A-Za-z]+)\s[\d]+\s[\d+\.\d+]+\s[\d]+\s[\d+\.\d+]+\s+[\d]+\s[\d+\.\d+]+\s(?P<heap_committed>[\d]+)\s(?P<heap_init>[\d]+)\s(?P<heap_max>[\d]+)\s(?P<heap_used>[\d]+)/ {
        next
    }
}

@coherenceMemoryStatus {
    total_heap_committed[$node_id][$gc_name] += $heap_committed
    total_heap_used[$node_id][$gc_name] += $heap_used
    total_heap_max[$node_id][$gc_name] += $heap_max
    total_heap_init[$node_id][$gc_name] += $heap_init
}

Some debug steps I used:

I ran the simple line count program with the same file using --one_shot parameter. It given me the exact result.

Also locally I was able to parse the file and was able to get the result. But with NFS filesystem seems I was not able to get the similar result.

jaqx0r commented 5 years ago

rc16 seems pretty old. Can you run rc29 (just released today) and see if you can reproduce the problem?

Also, please give the full commandline you used (one of the first log messages in the mtail INFO log.)

jakirpatel commented 5 years ago

@jaqx0r Still not getting the metrics. Version: Build: mtail version v3.0.0-rc29 git revision b5aa95d10fb3af7b843e3814bc65aedd983425d9 go version go1.11.6 go arch amd64 go os linux

The command line:

I0325 11:23:58.965017   32679 main.go:95] Commandline: ["/home/weblogic/test/mtail" "-poll_interval" "1s" "-port" "3102" "-progs" "/home/weblogic/test/old.mtail" "-logs" "/home/weblogic/test/memory-status.txt" "-logtostderr"]

There are no errors while reading. Even I can see X number of lines read. But with the metrics I cant see any counter incremented.

Some more interesting result I got.

Commandline: ["/home/weblogic/test/mtail" "-poll_interval" "1s" "-port" "3102" "-progs" "/home/weblogic/test/old.mtail" "-logs" "/home/weblogic/test/memory-status.txt" "-logtostderr" "-v" "2"]

While logs get written I got following log from mtail:

I0325 11:43:09.947934   21234 log_watcher.go:145] Update fi
I0325 11:43:09.947940   21234 log_watcher.go:130] stat
I0325 11:43:09.949704   21234 log_watcher.go:170] sending update for /home/weblogic/test/memory-status.txt
I0325 11:43:09.949731   21234 log_watcher.go:174] No modtime change for /home/weblogic/test/mtail, no send
I0325 11:43:09.949739   21234 log_watcher.go:174] No modtime change for /home/weblogic/test/old.mtail, no send
I0325 11:43:09.949744   21234 log_watcher.go:145] Update fi
I0325 11:43:09.949749   21234 log_watcher.go:130] stat
I0325 11:43:09.949766   21234 log_watcher.go:145] Update fi
I0325 11:43:09.949778   21234 tail.go:284] Event type watcher.Event{Op:2, Pathname:"/home/weblogic/test/memory-status.txt"}
I0325 11:43:09.949799   21234 tail.go:188] handleLogUpdate /home/weblogic/test/memory-status.txt
I0325 11:43:09.949819   21234 file.go:139] Path /home/weblogic/test/memory-status.txt already being watched, and inode not changed.
I0325 11:43:09.949825   21234 file.go:143] doing the normal read
I0325 11:43:09.949843   21234 file.go:171] /home/weblogic/test/memory-status.txt: file type does not support deadline
I0325 11:43:09.951858   21234 file.go:174] Read count 4096 err <nil>
I0325 11:43:09.951906   21234 file.go:221] Line sent
I0325 11:43:09.951932   21234 file.go:221] Line sent
I0325 11:43:09.952095   21234 file.go:221] Line sent
I0325 11:43:09.952110   21234 file.go:221] Line sent
I0325 11:43:09.952122   21234 file.go:221] Line sent
I0325 11:43:09.952209   21234 file.go:221] Line sent
I0325 11:43:09.952235   21234 file.go:221] Line sent
I0325 11:43:09.952252   21234 file.go:221] Line sent
I0325 11:43:09.952338   21234 file.go:221] Line sent
I0325 11:43:09.952363   21234 file.go:221] Line sent
I0325 11:43:09.952379   21234 file.go:221] Line sent

I forgot to mention that /home/weblogic/test/memory-status.txt is a symlink to other file.

After looking into the problem more close seems it maybe the issue of NFS file system. ?? I think enabling the soft mount option will help on this.

For NFS mount I am using following options:

(rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=100.66.81.33,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=x.x.x.x)

Some of references I got : https://www.reddit.com/r/golang/comments/8npwin/how_to_implement_timeout_in_file_reads/

jakirpatel commented 5 years ago

@jaqx0r

I have applied the soft option while mounting the NFS. But seems the problem of reading the files still persist. Is there any recommendation that how the NFS filesystem should be mounted to read the files through mtail ?

Your inputs will be helpful.

jaqx0r commented 5 years ago

I don't think I can at this time, it's not clear to me yet what the problem is. My guess so far is that the files are not being read after polling. I don't believe at this point that changing NFS mount options will fix it.

On Mon., 25 Mar. 2019, 19:01 Jakir Patel, notifications@github.com wrote:

@jaqx0r https://github.com/jaqx0r

I have applied the soft option while mounting the NFS. But seems the problem of reading the files still persist. Is there any recommendation that how the NFS filesystem should be mounted to read the files through mtail ?

Your inputs will be helpful.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/221#issuecomment-476091632, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5b-zLv5c6N06BRgh4QLpc92TjSs8Znks5vaIJlgaJpZM4cF6z- .

jakirpatel commented 5 years ago

@jaqx0r Might be regenerating the problem will be more good.

jakirpatel commented 5 years ago

@jaqx0r

Seems it was the regex issue. For somehow the regex was not matching. Please resolve it. Thank you for your support.

There was japanese character in the log timestamp. However with the regex on my local machine I was able to see it as ?

I think I could have tested the regex before direct with the logfile.

jaqx0r commented 5 years ago

Um.. what regex issue?

On Tue, 26 Mar 2019 at 20:13, Jakir Patel notifications@github.com wrote:

@jaqx0r https://github.com/jaqx0r

Seems it was the regex issue. For somehow the regex was not matching. Please resolve it. Thank you for your support.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/221#issuecomment-476536382, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5b-7yYV-pUN-cFGWpzvukrDPCL8ezYks5vaeS8gaJpZM4cF6z- .

jakirpatel commented 5 years ago

@jaqx0r

Basically my regex was not matching with log format. because of that it was not passing from regex block. That was reason my all counters never incremented. However when I have added simple line count counter it got incremented (with no regex).

What I think mtail should provide the info of NO MATCHING lines with regex as well. Some counter maybe so at first place we ll get to know how many lines get matched with regex.

But I observed one thing in the logs I can still see this,


I0325 11:43:09.949843   21234 file.go:171] /home/weblogic/test/memory-status.txt: file type does not support deadline
I0325 11:43:09.951858   21234 file.go:174] Read count 4096 err <nil>
jaqx0r commented 5 years ago

https://github.com/google/mtail/blob/master/docs/Testing.md describes one way to test your programs on snippets of logs. It does require making changes to the code to set it up, however but lets you set expectations on what the counters should be after processing a piece of log with your program.

In the mtail language, the otherwise keyword can be used to catch logs that "would otherwise not match" anything in the same block: https://github.com/google/mtail/blob/master/docs/Language.md#otherwise-clauses (the else keyword is similar) You could use otherwise to increment your own counter when no lines match.

In the olden days, mtail did have a special exported counter and a ring buffer per program for debugging log lines that didn't match. The ring buffer turned out to be expensive, and we'd made the system spammy to encourage people to try to handle every single log line for coverage, but that caused programs to become more complex and computationally expensive, so I didn't include it in the open sourcing of mtail. But it is easy to add a counter back for counting the number of times each program didn't take any action at all on a line, so I will do that.

Your log messages look like a red herring here. You've said that you're mounting over NFS, and I am certain that NFS doesn't support a read timeout. You get the following 'Read count' indicating that the read was successful and no error, so it's all working as intended as far as I can tell, although perhaps too noisy.

Going back to your program, after:

@coherenceMemoryStatus {
    total_heap_committed[$node_id][$gc_name] += $heap_committed
    total_heap_used[$node_id][$gc_name] += $heap_used
    total_heap_max[$node_id][$gc_name] += $heap_max
    total_heap_init[$node_id][$gc_name] += $heap_init
}

append these lines

otherwise {
  unmatched_lines++
}

for debugging.

I might also add a flag to enable logging of unmatched lines in test mode, but that will be very noisy in production.

jakirpatel commented 5 years ago

I got the following error with the newer version not sure what happened. It says the runtime error:

I0402 08:20:25.573032    9751 main.go:94] mtail version v3.0.0-rc29 git revision b5aa95d10fb3af7b843e3814bc65aedd983425d9 go version go1.11.6 go arch amd64 go os linux
I0402 08:20:25.573058    9751 main.go:95] Commandline: ["/home/gecope/mtail/mtail" "-poll_interval" "1s" "-port" "3902" "-progs" "progs/jmxCacheSize.mtail" "-logs" "/home/gecope/mtail-cron/bsktkvs/cache-size.txt" "-logs" "/home/gecope/mtail-cron/bsktkvsA/cache-size.txt" "-logtostderr" "-v" "2"]
I0402 08:24:59.574501    9751 log_watcher.go:145] Update fi
I0402 08:24:59.574848    9751 file.go:174] Read count 669 err <nil>
I0402 08:24:59.574865    9751 file.go:221] Line sent
I0402 08:24:59.574874    9751 file.go:221] Line sent
I0402 08:24:59.574935    9751 file.go:221] Line sent
I0402 08:24:59.574945    9751 file.go:221] Line sent
I0402 08:24:59.574950    9751 file.go:221] Line sent
I0402 08:24:59.574993    9751 vm.go:90] jmxCacheSize.mtail: Runtime error: panic in thread &vm.thread{pc:24, matched:false, matches:map[int][]string{0:[]string{"/home/gecope/mtail-cron/bsktkvsA/cache-size.txt", "bsktkvsA"}, 1:[]string{"1126\t�� 4 02 17:24:32 JST 2019\tOrderIdDistributedCache\torder-id\t1458924.0\t1458924.0\t1.3913383483886719\t1.0", "OrderIdDistributedCache", "order-id", "1458924.0", "1458924.0", "1.3913383483886719", "1.0"}}, time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, stack:[]interface {}{true, (*datum.StringDatum)(0xc0002cf740)}} at instr "{cat <nil>}": interface conversion: interface {} is *datum.StringDatum, not string
I0402 08:24:59.575078    9751 vm.go:91] VM stack:
goroutine 42 [running]:
runtime/debug.Stack(0xc00047a190, 0x47, 0xc00047f670)
    /usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/google/mtail/internal/vm.(*VM).errorf(0xc00029c280, 0x94512c, 0x23, 0xc00047f670, 0x3, 0x3)
    /go/src/github.com/google/mtail/internal/vm/vm.go:91 +0x113
github.com/google/mtail/internal/vm.(*VM).execute.func1(0xc00029c280, 0xc0000c4cd0, 0x26, 0x0, 0x0)
    /go/src/github.com/google/mtail/internal/vm/vm.go:296 +0x142
panic(0x8ad3a0, 0xc0002cf7d0)
    /usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/google/mtail/internal/vm.(*VM).execute(0xc00029c280, 0xc0000c4cd0, 0x26, 0x0, 0x0)
    /go/src/github.com/google/mtail/internal/vm/vm.go:720 +0x5ded
github.com/google/mtail/internal/vm.(*VM).processLine(0xc00029c280, 0xc00055c120)
    /go/src/github.com/google/mtail/internal/vm/vm.go:748 +0x18c
github.com/google/mtail/internal/vm.(*VM).Run(0xc00029c280, 0x6a6d7843, 0xc0001ae420, 0xc0001ae480, 0xc0001ae4e0)
    /go/src/github.com/google/mtail/internal/vm/vm.go:767 +0xf5
created by github.com/google/mtail/internal/vm.(*Loader).CompileAndRun
    /go/src/github.com/google/mtail/internal/vm/loader.go:241 +0x70d
I0402 08:24:59.575097    9751 vm.go:92] Dumping vm state
I0402 08:24:59.575101    9751 vm.go:93] Name: jmxCacheSize.mtail
I0402 08:24:59.575104    9751 vm.go:94] Input: &logline.LogLine{Filename:"/home/gecope/mtail-cron/bsktkvsA/cache-size.txt", Line:"1126\t�� 4 02 17:24:32 JST 2019\tOrderIdDistributedCache\torder-id\t1458924.0\t1458924.0\t1.3913383483886719\t1.0"}
I0402 08:24:59.575115    9751 vm.go:95] Thread:
I0402 08:24:59.575118    9751 vm.go:96]  PC 23
I0402 08:24:59.575121    9751 vm.go:97]  Matched false
I0402 08:24:59.575125    9751 vm.go:98]  Matches map[0:[/home/gecope/mtail-cron/bsktkvsA/cache-size.txt bsktkvsA] 1:[1126   �� 4 02 17:24:32 JST 2019   OrderIdDistributedCache order-id    1458924.0   1458924.0   1.3913383483886719  1.0 OrderIdDistributedCache order-id 1458924.0 1458924.0 1.3913383483886719 1.0]]
I0402 08:24:59.575135    9751 vm.go:99]  Timestamp 0001-01-01 00:00:00 +0000 UTC
I0402 08:24:59.575145    9751 vm.go:100]  Stack [true ""@1554193499574966437]
I0402 08:24:59.575277    9751 vm.go:101] Prog: jmxCacheSize.mtail
Metrics
        0 Metric: name=total_cache_size program=jmxCacheSize.mtail kind=Counter type=String hidden=false keys=[service cache_name kvs] labelvalues=[LabelValue: [OrderIdDistributedCache order-id bsktkvsA] ""@1554193499574966437] source=jmxCacheSize.mtail:1:9-24 buckets=[]
        1 Metric: name=total_memory_bytes program=jmxCacheSize.mtail kind=Counter type=String hidden=false keys=[service cache_name kvs] labelvalues=[] source=jmxCacheSize.mtail:2:9-26 buckets=[]
        2 Metric: name=total_memory_mb program=jmxCacheSize.mtail kind=Counter type=String hidden=false keys=[service cache_name kvs] labelvalues=[] source=jmxCacheSize.mtail:3:9-23 buckets=[]
Regexps
        0 /[^\t]+/mtail-cron/(?P<kvs>[a-zA-Z]+)[^\t]+/
        1 /\d+\t[^\t]+\t(?P<service>[^\t]+)\t(?P<cache_name>[^\t]+)\t(?P<cache_size>\d.+)\t(?P<memory_bytes>\d.+)\t(?P<memory_mb>\d.+)\t(?P<average_object_size>\d.+)/
Strings
 disasm  l          op  opnd
         0 getfilename     0
         1      smatch     0
         2       match     1
         3         jnm    66
         4  setmatched false
         5        push     1
         6      capref     1
         7        push     1
         8      capref     2
         9        push     0
        10      capref     1
        11       mload     0
        12       dload     3
        13        push     1
        14      capref     1
        15        push     1
        16      capref     2
        17        push     0
        18      capref     1
        19       mload     0
        20       dload     3
        21        push     1
        22      capref     3
        23         cat <nil>
        24        sset <nil>
        25        push     1
        26      capref     1
        27        push     1
        28      capref     2
        29        push     0
        30      capref     1
        31       mload     1
        32       dload     3
        33        push     1
        34      capref     1
        35        push     1
        36      capref     2
        37        push     0
        38      capref     1
        39       mload     1
        40       dload     3
        41        push     1
        42      capref     4
        43         cat <nil>
        44        sset <nil>
        45        push     1
        46      capref     1
        47        push     1
        48      capref     2
        49        push     0
        50      capref     1
        51       mload     2
        52       dload     3
        53        push     1
        54      capref     1
        55        push     1
        56      capref     2
        57        push     0
        58      capref     1
        59       mload     2
        60       dload     3
        61        push     1
        62      capref     5
        63         cat <nil>
        64        sset <nil>
        65  setmatched  true
I0402 08:24:59.575335    9751 file.go:221] Line sent
I0402 08:24:59.575340    9751 file.go:221] Line sent
I0402 08:24:59.575345    9751 file.go:171] /home/gecope/mtail-cron/bsktkvsA/cache-size.txt: file type does not support deadline
I0402 08:24:59.575352    9751 file.go:174] Read count 0 err EOF
jakirpatel commented 5 years ago

Hmm figured it out. Seems I was using counter. But it was double. Now seems ok.

jaqx0r commented 5 years ago

I'm certain there's still a bug but your last update gives me some idea of whta it was.

I think that traceback for the runtime error can be improved greatly because right now you need a good understanding of the mtail VM internals to debug it.

On Wed, 3 Apr 2019 at 15:36, Jakir Patel notifications@github.com wrote:

Hmm figured it out. Seems I was using counter. But it was double. Now seems ok.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/221#issuecomment-479336559, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5b-2S6d9z2s5pfhmiZTN7rbUj67Rrsks5vdC-0gaJpZM4cF6z- .

jakirpatel commented 5 years ago

@jaqx0r

The runtime error should be more clear I think so. For me it took time to debug with the messages.

jnovack commented 5 years ago

Is there any recommendation that how the NFS filesystem should be mounted to read the files through mtail ? --@jakirpatel

Log shipping. By introducing NFS, you are introducing a layer of complexity which is known to be complex and problematic for streaming.