m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Add alpha functionality to pcap parser #1028

Closed gfr10598 closed 2 years ago

gfr10598 commented 2 years ago

This change is Reviewable

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 6830


Changes Missing Coverage Covered Lines Changed/Added Lines %
active/throttle.go 3 4 75.0%
parser/pcap.go 10 13 76.92%
active/poller.go 18 28 64.29%
tcp/tcp.go 385 468 82.26%
<!-- Total: 436 533 81.8% -->
Files with Coverage Reduction New Missed Lines %
parser/pcap.go 1 75.68%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 6733: 1.6%
Covered Lines: 4181
Relevant Lines: 6397

💛 - Coveralls
gfr10598 commented 2 years ago

As this has evolved to run in sandbox, it now looks as though the throughput quickly falls to very low rates - less than 5% of index only throughput, and that this is largely due to GC overhead. Less than 2% of time is spent doing decompression, and more than 50% is showing up as GC

Showing nodes accounting for 4.09mins, 82.55% of 4.95mins total Dropped 484 nodes (cum <= 0.02mins) Showing top 50 nodes out of 124 flat flat% sum% cum cum% 0.59mins 12.01% 12.01% 2.75mins 55.50% runtime.mallocgc 0.35mins 7.09% 19.11% 0.47mins 9.59% runtime.heapBitsSetType 0.34mins 6.93% 26.04% 0.34mins 6.93% runtime.procyield 0.21mins 4.32% 30.36% 0.21mins 4.32% runtime.memclrNoHeapPointers 0.17mins 3.43% 33.79% 0.17mins 3.43% runtime.nextFreeFast 0.14mins 2.87% 36.66% 0.39mins 7.97% fmt.(pp).doPrintf 0.11mins 2.18% 38.84% 0.11mins 2.18% runtime.futex 0.10mins 2.11% 40.95% 0.12mins 2.36% runtime.walltime (inline) 0.09mins 1.89% 42.84% 0.11mins 2.14% runtime.heapBitsForAddr (inline) 0.09mins 1.80% 44.64% 1.04mins 21.07% github.com/google/gopacket/layers.(TCP).DecodeFromBytes 0.09mins 1.74% 46.39% 0.32mins 6.42% github.com/google/gopacket.(packet).addFinalDecodeError 0.08mins 1.70% 48.09% 0.09mins 1.79% runtime.(itabTableType).find 0.08mins 1.66% 49.75% 0.19mins 3.92% runtime.scanobject 0.08mins 1.64% 51.38% 0.12mins 2.48% fmt.(fmt).fmtInteger 0.07mins 1.43% 52.81% 2.42mins 48.96% github.com/google/gopacket.(eagerPacket).NextDecoder 0.07mins 1.40% 54.21% 2.56mins 51.76% runtime.newobject 0.07mins 1.35% 55.56% 4.73mins 95.50% github.com/m-lab/etl/tcp.(*Parser).Parse 0.06mins 1.27% 56.83% 2.97mins 59.90% github.com/google/gopacket/layers.decodeEthernet

gfr10598 commented 2 years ago

FYI, pprof obtained with gcloud config set project mlab-sandbox gcloud container clusters get-credentials data-processing --region us-east1 --project mlab-sandbox kubectl get pods | grep parser kubectl port-forward etl-parser-54d875fcf6-8frjr 9090:9090 go tool pprof localhost:9090

gfr10598 commented 2 years ago

Another odd behavior, unexplained so far, is that the number of goroutines and open FDs rapidly drops, and is generally much lower than in prod.

https://grafana.mlab-sandbox.measurementlab.net/dashboard/snapshot/1F8rrqgsUhQJ6QragQQNJfmhkNJOpSd9

gfr10598 commented 2 years ago

Speculating that there may be a bug in the Throttle that limits the number of tasks in flight. It is odd that pcap prototype might trigger this, but if that is the case, then it is important to debug it.

There is also strange behavior in the ActiveTasks metric. For most of the parser history, this gauge quickly decays to near zero, whereas we would expect it to generally stay very close to 100.

As noted above, the drop in FDs is perhaps more clear. This is tracked by entirely external production code, so is likely reliable. Without the prototype pcap code, the number of FDs maintains a floor around 325, with frequent periods around 2000. With the pcap prototype, it diminishes rapidly at first, then gradually down to 10-20 after 6 hours or os.

gfr10598 commented 2 years ago

At fe76fed there is significant functionality, but the parsing is getting quite slow. The throughput per instance is only about 200 Mb/sec, based on version running in sandbox on 11/20. This is less than 20% of the index only parser.

A large amount of time is being spent in the gopacket tcp layer and ip layer parsing. NextTest is only using about 17% of the cpu.

Nov19 cpu

gfr10598 commented 2 years ago

The GC was taking about 12% of cpu, and allocations were running 25 - 30 Mb/sec.

gfr10598 commented 2 years ago
File: etl_worker
Build ID: 503578a29e6077b5ce3b23c5c52e37e4318876f3
Type: cpu
Time: Nov 19, 2021 at 5:00pm (EST)
Duration: 30.18s, Total samples = 5.05mins (1004.61%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top60 -cum
Showing nodes accounting for 140.20s, 46.24% of 303.20s total
Dropped 615 nodes (cum <= 1.52s)
Showing top 60 nodes out of 176
      flat  flat%   sum%        cum   cum%
         0     0%     0%    289.07s 95.34%  github.com/m-lab/etl/active.(*GardenerAPI).RunAll.func1
         0     0%     0%    289.07s 95.34%  github.com/m-lab/etl/active.(*throttledRunnable).Run
         0     0%     0%    289.07s 95.34%  github.com/m-lab/etl/worker.ProcessGKETask
         0     0%     0%    289.07s 95.34%  golang.org/x/sync/errgroup.(*Group).Go.func1
         0     0%     0%    289.07s 95.34%  main.(*runnable).Run
         0     0%     0%    289.06s 95.34%  github.com/m-lab/etl/task.(*Task).ProcessAllTests
         0     0%     0%    289.06s 95.34%  github.com/m-lab/etl/worker.DoGKETask
     0.02s 0.0066% 0.0066%    236.80s 78.10%  github.com/m-lab/etl/parser.(*PCAPParser).ParseAndInsert
     9.28s  3.06%  3.07%    236.55s 78.02%  github.com/m-lab/etl/tcp.(*Parser).Parse
    22.08s  7.28% 10.35%    119.18s 39.31%  runtime.mallocgc
     1.87s  0.62% 10.97%     94.85s 31.28%  runtime.newobject
     0.55s  0.18% 11.15%     86.90s 28.66%  github.com/google/gopacket.(*lazyPacket).ErrorLayer
     3.16s  1.04% 12.19%     86.46s 28.52%  github.com/google/gopacket.(*lazyPacket).decodeNextLayer
     0.92s   0.3% 12.49%     84.73s 27.95%  github.com/m-lab/etl/tcp.(*Parser).tcpLayer
     4.16s  1.37% 13.87%     83.81s 27.64%  github.com/m-lab/etl/tcp.(*state).Update
     0.75s  0.25% 14.11%     78.67s 25.95%  github.com/google/gopacket.DecodeFunc.Decode
     0.21s 0.069% 14.18%     70.23s 23.16%  runtime.systemstack
     1.49s  0.49% 14.67%     57.30s 18.90%  github.com/google/gopacket.LayerType.Decode
         0     0% 14.67%     52.20s 17.22%  github.com/m-lab/etl/storage.(*GCSSource).NextTest
         0     0% 14.67%     50.10s 16.52%  bytes.(*Buffer).ReadFrom
         0     0% 14.67%     50.05s 16.51%  github.com/m-lab/etl/storage.(*GCSSource).nextData
         0     0% 14.67%     49.95s 16.47%  io/ioutil.ReadAll (inline)
         0     0% 14.67%     49.95s 16.47%  io/ioutil.readAll
     0.41s  0.14% 14.81%     46.71s 15.41%  runtime.(*mcache).nextFree
     0.05s 0.016% 14.83%     46.68s 15.40%  compress/gzip.(*Reader).Read
     0.04s 0.013% 14.84%     46.27s 15.26%  compress/flate.(*decompressor).Read
     0.17s 0.056% 14.89%     45.94s 15.15%  runtime.(*mcache).refill
     7.76s  2.56% 17.45%     44.64s 14.72%  compress/flate.(*decompressor).huffmanBlock
     2.16s  0.71% 18.17%     44.35s 14.63%  runtime.(*mcentral).cacheSpan
     1.72s  0.57% 18.73%     42.60s 14.05%  github.com/m-lab/etl/tcp.(*state).Option
    17.29s  5.70% 24.44%     30.48s 10.05%  compress/flate.(*decompressor).huffSym
     1.04s  0.34% 24.78%     30.21s  9.96%  github.com/google/gopacket/layers.decodeTCP
     0.28s 0.092% 24.87%     28.88s  9.53%  runtime.deductSweepCredit
     2.09s  0.69% 25.56%     28.64s  9.45%  runtime.sweepone
     1.21s   0.4% 25.96%     26.70s  8.81%  runtime.(*mspan).sweep
     1.22s   0.4% 26.36%     26.03s  8.59%  github.com/m-lab/etl/tcp.(*Tracker).Ack
     0.39s  0.13% 26.49%     23.25s  7.67%  github.com/google/gopacket/layers.EthernetType.Decode
     0.03s 0.0099% 26.50%     22.81s  7.52%  runtime.(*mheap).freeSpan
     0.17s 0.056% 26.56%     22.78s  7.51%  runtime.(*mheap).freeSpan.func1
     0.45s  0.15% 26.71%     22.45s  7.40%  github.com/google/gopacket.NewPacket
     8.79s  2.90% 29.60%        21s  6.93%  bufio.(*Reader).ReadByte
     7.13s  2.35% 31.96%     20.33s  6.71%  runtime.gcWriteBarrier
     1.57s  0.52% 32.47%     20.07s  6.62%  github.com/google/gopacket/layers.decodeEthernet
     9.02s  2.97% 35.45%     19.87s  6.55%  runtime.scanobject
         0     0% 35.45%     19.73s  6.51%  runtime.lock (inline)
     1.58s  0.52% 35.97%     19.73s  6.51%  runtime.lock2
         0     0% 35.97%     19.73s  6.51%  runtime.lockWithRank (inline)
     1.09s  0.36% 36.33%     17.83s  5.88%  github.com/google/gopacket/layers.decodeIPv4
         0     0% 36.33%     17.18s  5.67%  compress/flate.(*decompressor).nextBlock
     0.89s  0.29% 36.62%     16.40s  5.41%  github.com/m-lab/etl/tcp.(*Tracker).Sack
     0.08s 0.026% 36.65%     16.16s  5.33%  runtime.wbBufFlush
     0.02s 0.0066% 36.66%     16.08s  5.30%  runtime.wbBufFlush.func1
     2.95s  0.97% 37.63%     16.08s  5.30%  runtime.wbBufFlush1
     0.05s 0.016% 37.65%     15.76s  5.20%  bufio.(*Reader).fill
         0     0% 37.65%     15.45s  5.10%  archive/tar.(*Reader).Read
     0.03s 0.0099% 37.66%     15.45s  5.10%  archive/tar.(*regFileReader).Read
     0.85s  0.28% 37.94%     15.23s  5.02%  github.com/m-lab/go/logx.(*logEvery).ok
     0.44s  0.15% 38.08%     14.70s  4.85%  runtime.makeslice
    10.85s  3.58% 41.66%     14.09s  4.65%  runtime.heapBitsSetType
    13.89s  4.58% 46.24%     13.89s  4.58%  runtime.procyield
stephen-soltesz commented 2 years ago

Closing old PR - and referencing the notes from this change in https://github.com/m-lab/etl/issues/1085