googlefonts / fontbakery-dashboard

A library-scale web dashboard for Font Bakery, no longer developed
Apache License 2.0
28 stars 10 forks source link

[python] dispatcher: diffenator worker gets terminated unfinished with big jobs (locally on minikube) #100

Closed graphicore closed 5 years ago

graphicore commented 5 years ago

The part of the description of the pod that get's killed that hints to what is happening (full description at the bottom of this entry):

Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 27 Mar 2019 01:53:04 +0100
      Finished:     Wed, 27 Mar 2019 02:32:11 +0100

The worker is running diffenator, comparing all of the Volkornbrot family.

Exit Code: 137 hints that the container is terminated because it's running out of memory (OOM) described similarly here.

So, two paths of action seem worthwhile:

  1. If possible, increase minikube memory.
  2. Decrease worker memory consumption by
    • a) review/change diffenator
    • b) split worker into smaller sub-workers.

2b involves some work that ideally shouldn't be necessary for this kind of problem–rather for speeding up overall execution time i.e. parallel execution where/when more workers are available. 2a is preferred now if possible.


full pod description:

$ kf describe pod fontbakery-worker-5b5f68fc48-5kvhd
Name:               fontbakery-worker-5b5f68fc48-5kvhd
Namespace:          fontbakery
Priority:           0
PriorityClassName:  <none>
Node:               minikube/10.0.2.15
Start Time:         Wed, 27 Mar 2019 01:53:02 +0100
Labels:             pod-template-hash=5b5f68fc48
                    run=fontbakery-worker
Annotations:        <none>
Status:             Running
IP:                 172.17.0.17
Controlled By:      ReplicaSet/fontbakery-worker-5b5f68fc48
Containers:
  fontbakery-worker:
    Container ID:  docker://cc0d6b2bb6cc7d4a34fea2af94c6314280c449f1b5e682a7c55c91f8264ea071
    Image:         fontbakery/base-python:1
    Image ID:      docker://sha256:e6a3767c70f3137551378c9c821fdd8992686ce41e0378d7c0a044e876c70926
    Port:          <none>
    Host Port:     <none>
    Command:
      python3
      -u
      worker-launcher.py
    State:          Running
      Started:      Wed, 27 Mar 2019 02:32:24 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 27 Mar 2019 01:53:04 +0100
      Finished:     Wed, 27 Mar 2019 02:32:11 +0100
    Ready:          True
    Restart Count:  1
    Environment:
      FONTBAKERY_WORKER_LOG_LEVEL:  DEBUG
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-drx9x (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-drx9x:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-drx9x
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age               From               Message
  ----    ------     ----              ----               -------
  Normal  Scheduled  43m               default-scheduler  Successfully assigned fontbakery/fontbakery-worker-5b5f68fc48-5kvhd to minikube
  Normal  Pulled     4m (x2 over 43m)  kubelet, minikube  Container image "fontbakery/base-python:1" already present on machine
  Normal  Created    4m (x2 over 43m)  kubelet, minikube  Created container
  Normal  Started    4m (x2 over 43m)  kubelet, minikube  Started container
graphicore commented 5 years ago

For completeness, but not very useful I think, here are the logs of a failed and terminated "fontbakery-worker" running diffenator with the Vollkorn from above.

The errors E0327 ... backup_poller.cc:124 ... are interesting but probably a side effect of the pod being under high load. The file src/core/lib/iomgr/ev_epoll1_linux.cc is from gRPC, so I'm guessing the cache and persistence gRPC clients are timing out or something similar.

$ kf logs --previous fontbakery-worker-5b5f68fc48-5kvhd
INFO:FB_WORKER:loglevel: DEBUG
INFO:FB_WORKER:RethinkDB HOST 10.102.163.63 PORT 28015
INFO:FB_WORKER:Waiting for messages in fontbakery-worker...
INFO:FB_WORKER:consuming incoming message ...
DEBUG:FB_WORKER:Got job description: worker_name: "diffenator"
job {
  type_url: "type.googleapis.com/fontbakery.dashboard.FamilyJob"
  value: "\n@c3fc7cf3423058b61550e6859b56b68773b6c97183e975880f5ea967b69cdf7e\022\211\001\nBc3fc7cf3423058b61550e6859b56b68773b6c97183e975880f5ea967b69cdf7e:0\022@c3fc7cf3423058b61550e6859b56b68773b6c97183e975880f5ea967b69cdf7e\032\0010"
}

INFO:FB_WORKER:consuming a job for: <class 'worker.diffenator.Diffenator'> with <Basic.Deliver(['consumer_tag=ctag1.17caf35750ba4468ae25ca4c2d55ea6f', 'delivery_tag=1', 'exchange=', 'redelivered=True', 'routing_key=fontbakery-worker'])> <BasicProperties(['delivery_mode=2', 'headers={}'])>
DEBUG:FB_WORKER:Files in Tempdir /tmp/tmpb8fraurz: ['/tmp/tmpb8fraurz/after/Vollkorn-SemiBoldItalic.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-Medium.ttf', '/tmp/tmpb8fraurz/after/METADATA.pb', '/tmp/tmpb8fraurz/after/Vollkorn-MediumItalic.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-SemiBold.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-Italic.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-BlackItalic.ttf', '/tmp/tmpb8fraurz/after/DESCRIPTION.en_us.html', '/tmp/tmpb8fraurz/after/OFL.txt', '/tmp/tmpb8fraurz/after/Vollkorn-BoldItalic.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-ExtraBoldItalic.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-Black.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-Regular.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-ExtraBold.ttf', '/tmp/tmpb8fraurz/after/Vollkorn-Bold.ttf', '/tmp/tmpb8fraurz/after/FONTLOG.txt', '/tmp/tmpb8fraurz/before/Vollkorn-SemiBoldItalic.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-SemiBold.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-Italic.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-BlackItalic.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-BoldItalic.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-Black.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-Regular.ttf', '/tmp/tmpb8fraurz/before/Vollkorn-Bold.ttf']
INFO:fontdiffenator:'diff_nametable'  0.58 ms
INFO:fontdiffenator:'diff_attribs'  1.31 ms
INFO:fontdiffenator:Rendering glyph differences. Be patient
E0327 03:13:37.253009780    4331 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553652817.251048301","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553652817.251040999","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:15:12.893402882    4339 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553652912.887556358","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553652912.887544261","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:15:27.675585564    4622 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553652927.675187391","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553652927.675183504","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:16:43.076839745    5474 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653002.039966466","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653001.179768074","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:16:52.492755280    5489 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653012.492391359","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653012.492387460","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:17:30.139855283    5512 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653050.122289256","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653050.122283964","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:17:41.742626392    5523 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653061.662958260","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653061.662942841","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:17:58.572300760    5530 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653078.568873703","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653078.568866882","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:18:13.678178052    6127 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653093.677908218","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653093.677904581","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:21:42.188160347   14672 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653302.186797382","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653302.186785188","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:21:48.269849885   14680 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653308.269311660","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653308.269306133","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:22:23.997368010   14749 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653343.996981124","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653343.996977948","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:23:15.888473375   14780 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653395.880127174","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653395.880118616","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:23:37.906640614   14796 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653417.906054284","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653417.906047807","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:24:38.699788702   20437 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653478.699540482","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653478.699505072","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
E0327 03:25:04.979810059   21640 backup_poller.cc:124]       Run client channel backup poller: {"created":"@1553653504.979501630","description":"pollset_work","file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":242,"referenced_errors":[{"created":"@1553653504.979496712","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epoll1_linux.cc","file_line":723,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
INFO:fontdiffenator:'diff_glyphs'  736884.57 ms
INFO:fontdiffenator:'diff_kerning'  7522.01 ms
INFO:fontdiffenator:'diff_metrics'  34.19 ms
INFO:fontdiffenator:'diff_marks'  52778.42 ms
INFO:fontdiffenator:'diff_marks'  2300.87 ms
graphicore commented 5 years ago

I did not make the memory hunger go away, but I solved the problem for me, using a laptop with much more memory and creating a minikube instance with 16 GB ram: minikube start --memory=16384. Now everything works as expected again.

Here's a python memory-profiler (mprof) graph I created on the new, not crashing instance. One can see nicely the eight jobs Vollkorn created:

Screenshot from 2019-04-03 16-54-53

The peaks at the end of each job are the creation of the reports. The plateaus are most probably in the diffenator diff_rendering which uses harfbuzz and renders and compares most glyphs and thus takes a while.

Peak memory is at ~ 1.2 GB!