keybase / client

Keybase Go Library, Client, Service, OS X, iOS, Android, Electron
BSD 3-Clause "New" or "Revised" License
8.82k stars 1.22k forks source link

kbfs failures #21635

Open wodz opened 4 years ago

wodz commented 4 years ago

I tried rsyncing ~85GB dir to kbfs. I got error: rsync: write failed on : Transport endpoint is not connected (107) rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2]

Now /keybase has KBFS_NOT_RUNNING link and there is no way to start kbfs again. run_keybase returns: Job for kbfs.service failed because a timeout was exceeded. ~/.local/share/keybase has grown to ~60GB.

This happens second time. Deleting ~/.local/share/keybase allows to restart service but deletion of this folder is painfully slow as it contains tons of small files.

my log id: cb6789a950feafae0003eb1c

wodz commented 4 years ago

Now I see in dmesg kbfsfuse got killed by oom-kill. kbfsfuse allocated ~18GB of memory. Is there something which can be done about it?

strib commented 4 years ago

Thanks for the report. I'm surprised that kbfsfuse would have that much memory in use -- it should top out at around 1.5 GB. (It's possible it had 18 GB of virtual, not resident memory, allocated at once, but virtual memory shouldn't trigger oom-killer as far as I know, so that's weird.)

The logs you sent indicate that KBFS was getting killed and restarted about once a minute by something on your machine (I guess oom-killer?) -- again, it seems weird that it would be able to allocate much memory in that time. Also, the logs include some memory profiles that shows it at about 156 MB of memory allocated around the most recent time it was killed. So again, weird.

You do have about 32 GB still left in your journal to flush to our servers. If you want, you can run some commands to get rid of that unflushed data, and then try to flush again when you're ready. If you do want to try that, do this:

Then, once you verify that everything is ok and you haven't lost any data that wasn't backed up somewhere else, you can delete the whole $HOME/.local/share/keybase/kbfs_journal/v1/d6fc1f16fa61b7e4.bkp directory.

It would be nice to figure out why KBFS is being killed though. Maybe you can watch top and see what the reported resident memory is before it's killed?

wodz commented 4 years ago

Please find below what i have in dmesg about oom-kill activity. Maybe it will shed some light.

[2601810.697259] Tasks state (memory values in pages): [2601810.697260] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [2601810.697267] [ 558] 0 558 26475 34 98304 36 0 lvmetad [2601810.697269] [ 998] 0 998 47169 66 208896 141 0 thermald [2601810.697270] [ 1003] 0 1003 108581 196 352256 202 0 ModemManager [2601810.697271] [ 1008] 0 1008 10027 23 118784 50 0 cron [2601810.697273] [ 1009] 102 1009 65758 165 151552 308 0 rsyslogd [2601810.697274] [ 1010] 0 1010 27619 33 114688 58 0 irqbalance [2601810.697275] [ 1011] 103 1011 12925 466 147456 82 -900 dbus-daemon [2601810.697276] [ 1036] 0 1036 162926 543 471040 377 0 NetworkManager [2601810.697278] [ 1038] 0 1038 1137 6 65536 35 0 acpid [2601810.697279] [ 1039] 0 1039 129534 662 364544 245 0 udisksd [2601810.697280] [ 1041] 0 1041 17676 55 172032 148 0 systemd-logind [2601810.697281] [ 1044] 0 1044 11400 80 135168 153 0 wpa_supplicant [2601810.697282] [ 1047] 117 1047 11861 106 135168 68 0 avahi-daemon [2601810.697283] [ 1049] 0 1049 77582 147 237568 169 0 accounts-daemon [2601810.697284] [ 1051] 0 1051 44832 1 233472 1986 0 networkd-dispat [2601810.697286] [ 1065] 117 1065 11768 15 131072 69 0 avahi-daemon [2601810.697287] [ 1073] 0 1073 81012 658 274432 359 0 polkitd [2601810.697289] [ 1124] 118 1124 239021 341 356352 1103 0 colord [2601810.697290] [ 1126] 0 1126 49009 0 278528 2002 0 unattended-upgr [2601810.697291] [ 1132] 0 1132 77221 56 245760 285 0 lightdm [2601810.697292] [ 1150] 0 1150 147168 23612 741376 4349 0 Xorg [2601810.697293] [ 1155] 0 1155 6242 0 73728 35 0 agetty [2601810.697295] [ 1293] 0 1293 6497 60 86016 259 0 dhclient [2601810.697296] [ 1310] 113 1310 95355 25 360448 395 0 whoopsie [2601810.697297] [ 1325] 114 1325 14234 27 147456 87 0 kerneloops [2601810.697298] [ 1327] 114 1327 14234 27 143360 86 0 kerneloops [2601810.697299] [ 2155] 0 2155 62637 69 258048 595 0 lightdm [2601810.697301] [ 3317] 1000 3317 19364 317 208896 166 0 systemd [2601810.697302] [ 3318] 1000 3318 65374 3 274432 655 0 (sd-pam) [2601810.697303] [ 3379] 1000 3379 72538 157 176128 142 0 gnome-keyring-d [2601810.697304] [ 3433] 1000 3433 1156 2 49152 22 0 sh [2601810.697305] [ 3559] 1000 3559 12940 455 147456 121 0 dbus-daemon [2601810.697306] [ 3863] 1000 3863 2825 10 61440 69 0 ssh-agent [2601810.697308] [ 3947] 1000 3947 85147 171 421888 328 0 xfce4-session [2601810.697309] [ 4032] 1000 4032 14847 94 155648 86 0 xfconfd [2601810.697310] [ 4129] 1000 4129 101943 1563 512000 890 0 xfwm4 [2601810.697311] [ 4177] 1000 4177 96666 1949 516096 587 0 xfce4-panel [2601810.697312] [ 4179] 1000 4179 169116 14830 884736 19484 0 Thunar [2601810.697314] [ 4181] 1000 4181 166069 7863 659456 1408 0 xfdesktop [2601810.697315] [ 4210] 1000 4210 98129 201 462848 522 0 xfsettingsd [2601810.697316] [ 4213] 1000 4213 170130 3924 643072 3381 0 nm-applet [2601810.697317] [ 4214] 1000 4214 19955 195 196608 50 0 xscreensaver [2601810.697318] [ 4243] 1000 4243 92696 266 479232 1031 0 xfce4-power-man [2601810.697320] [ 4248] 1000 4248 91968 52 233472 188 0 at-spi-bus-laun [2601810.697321] [ 4252] 1000 4252 73461 214 200704 101 0 gvfsd [2601810.697322] [ 4260] 1000 4260 108005 107 217088 195 0 gvfsd-fuse [2601810.697323] [ 4290] 1000 4290 147601 498 512000 994 0 update-notifier [2601810.697324] [ 4298] 1000 4298 65903 289 397312 4005 0 applet.py [2601810.697325] [ 4306] 1000 4306 85737 170 430080 1039 0 polkit-gnome-au [2601810.697326] [ 4341] 1000 4341 170932 106 622592 5229 0 blueman-applet [2601810.697328] [ 4369] 1000 4369 12544 83 143360 99 0 dbus-daemon [2601810.697329] [ 4389] 1000 4389 104240 127 421888 1076 0 light-locker [2601810.697330] [ 4415] 1000 4415 47007 130 135168 65 0 dconf-service [2601810.697331] [ 4513] 1000 4513 55193 38 192512 150 0 at-spi2-registr [2601810.697332] [ 4530] 1000 4530 489649 762 487424 374 0 pulseaudio [2601810.697333] [ 4534] 109 4534 45875 23 131072 55 0 rtkit-daemon [2601810.697334] [ 4557] 1000 4557 121108 956 536576 763 0 xfce4-notifyd [2601810.697336] [ 4562] 0 4562 81026 156 249856 201 0 upowerd [2601810.697337] [ 4600] 1000 4600 123700 877 540672 932 0 panel-1-whisker [2601810.697338] [ 4612] 1000 4612 45088 231 385024 250 0 panel-4-systray [2601810.697339] [ 4614] 1000 4614 101994 836 503808 569 0 panel-5-notific [2601810.697340] [ 4616] 1000 4616 111375 856 483328 545 0 panel-6-indicat [2601810.697341] [ 4617] 1000 4617 144862 1903 569344 3876 0 panel-7-statusn [2601810.697343] [ 4618] 1000 4618 103040 762 528384 638 0 panel-8-power-m [2601810.697344] [ 4621] 1000 4621 176456 1298 593920 658 0 panel-9-pulseau [2601810.697345] [ 4622] 1000 4622 44755 480 389120 227 0 panel-15-cpugra [2601810.697346] [ 4623] 1000 4623 90339 806 462848 470 0 panel-14-netloa [2601810.697347] [ 4644] 1000 4644 116846 58 270336 376 0 indicator-messa [2601810.697348] [ 4656] 1000 4656 99178 551 258048 303 0 gvfs-udisks2-vo [2601810.697349] [ 4660] 1000 4660 95124 102 225280 181 0 gvfs-afc-volume [2601810.697351] [ 4665] 1000 4665 68932 86 163840 94 0 gvfs-goa-volume [2601810.697352] [ 4669] 1000 4669 69421 96 167936 95 0 gvfs-mtp-volume [2601810.697353] [ 4673] 1000 4673 72619 108 188416 128 0 gvfs-gphoto2-vo [2601810.697354] [ 4755] 1000 4755 96862 265 245760 621 0 gvfsd-trash [2601810.697355] [ 4946] 1000 4946 51633 143 167936 104 0 gvfsd-metadata [2601810.697356] [ 6897] 1000 6897 20681 1 151552 166 0 obexd [2601810.697358] [ 7817] 1000 7817 114944 250 241664 149 0 gvfsd-network [2601810.697359] [ 7826] 1000 7826 196606 342 741376 532 0 gvfsd-smb-brows [2601810.697360] [ 7838] 1000 7838 114210 132 237568 162 0 gvfsd-dnssd [2601810.697361] [ 11038] 1000 11038 90363 166 249856 267 0 gvfsd-http [2601810.697362] [ 12727] 1000 12727 173530 246 708608 6756 0 scp-dbus-servic [2601810.697364] [ 13600] 1000 13600 97026 904 385024 942 0 gigolo [2601810.697365] [ 13604] 1000 13604 176681 74 720896 1226 0 gvfsd-smb [2601810.697367] [ 2711] 0 2711 96026 112 323584 363 0 packagekitd [2601810.697368] [ 6686] 1000 6686 404708 10200 1261568 7798 0 VirtualBox [2601810.697369] [ 6704] 1000 6704 47036 305 335872 401 0 VBoxXPCOMIPCD [2601810.697370] [ 6709] 1000 6709 314836 4592 684032 16593 0 VBoxSVC [2601810.697372] [ 25472] 105 25472 9231 0 114688 53 0 uuidd [2601810.697373] [ 21739] 1000 21739 20443 59 188416 117 0 gconfd-2 [2601810.697375] [ 17691] 1000 17691 167165 3706 610304 672 0 xfce4-terminal [2601810.697376] [ 17695] 1000 17695 7941 90 106496 391 0 bash [2601810.697378] [ 28915] 1000 28915 7941 309 102400 185 0 bash [2601810.697379] [ 29679] 1000 29679 133266 2278 479232 1 0 mousepad [2601810.697380] [ 31711] 1000 31711 7907 451 106496 1 0 bash [2601810.697382] [ 689] 101 689 17778 259 172032 17 0 systemd-resolve [2601810.697383] [ 694] 62583 694 35996 152 184320 0 0 systemd-timesyn [2601810.697384] [ 700] 0 700 45330 227 364544 10 0 systemd-journal [2601810.697385] [ 1888] 0 1888 11628 385 118784 0 -1000 systemd-udevd [2601810.697387] [ 995] 0 995 307531 2973 307200 2 -900 snapd [2601810.697388] [ 13647] 1000 13647 7908 419 102400 25 0 bash [2601810.697389] [ 14221] 1000 14221 7941 493 110592 0 0 bash [2601810.697390] [ 14311] 1000 14311 918062 80229 2883584 19843 0 firefox [2601810.697391] [ 14408] 1000 14408 765337 23139 2170880 584 0 Web Content [2601810.697392] [ 14470] 1000 14470 641917 6531 913408 153 0 WebExtensions [2601810.697394] [ 16871] 0 16871 49120 1310 258048 0 0 cupsd [2601810.697395] [ 16872] 0 16872 76546 503 352256 0 0 cups-browsed [2601810.697396] [ 16891] 7 16891 21602 203 200704 0 0 dbus [2601810.697397] [ 18330] 1000 18330 587964 171026 2351104 11270 0 keybase [2601810.697398] [ 18367] 1000 18367 177629 630 188416 1 0 keybase-redirec [2601810.697400] [ 18369] 1000 18369 4699249 3322613 32350208 13832 0 kbfsfuse [2601810.697401] [ 18396] 1000 18396 310680 6769 1400832 64 0 Keybase [2601810.697402] [ 18403] 1000 18403 93872 1616 647168 1 0 Keybase [2601810.697403] [ 18405] 1000 18405 93872 1616 495616 1 0 Keybase [2601810.697404] [ 18430] 1000 18430 130658 5977 897024 0 0 Keybase [2601810.697405] [ 18440] 1000 18440 213456 27091 2473984 40 0 Keybase [2601810.697407] [ 18445] 1000 18445 240055 30504 4501504 410 0 Keybase [2601810.697408] [ 19297] 1000 19297 35409 13474 315392 900 0 rsync [2601810.697409] [ 19298] 1000 19298 29904 4669 262144 1344 0 rsync [2601810.697410] [ 19299] 1000 19299 55556 5886 475136 2284 0 rsync [2601810.697412] [ 19528] 1000 19528 776217 56456 2244608 4494 0 thunderbird [2601810.697413] [ 20587] 1000 20587 192170 455 851968 15599 0 update-manager [2601810.697414] [ 20797] 1000 20797 101240 221 569344 1467 0 RDD Process [2601810.697415] [ 20987] 1000 20987 671268 1993 1183744 15627 0 Web Content [2601810.697416] [ 21273] 1000 21273 792282 21030 2400256 13039 0 Web Content [2601810.697418] [ 21914] 1000 21914 131453 33 237568 331 0 sd_espeak-ng [2601810.697419] [ 21921] 1000 21921 105335 46 208896 141 0 sd_dummy [2601810.697420] [ 21924] 1000 21924 105341 46 208896 144 0 sd_generic [2601810.697421] [ 21927] 1000 21927 44478 0 110592 76 0 speech-dispatch [2601810.697422] [ 21954] 1000 21954 634089 275 782336 3884 0 Web Content [2601810.697423] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=kbfsfuse,pid=18369,uid=1000 [2601810.697427] Out of memory: Kill process 18369 (kbfsfuse) score 781 or sacrifice child [2601810.697560] Killed process 18369 (kbfsfuse) total-vm:18796996kB, anon-rss:13290452kB, file-rss:0kB, shmem-rss:0kB [2601811.378836] oom_reaper: reaped process 18369 (kbfsfuse), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

wodz commented 4 years ago

I moved journal, started keybase and run again rsync. According to top after half an hour kbfsfuse is using over 2.3GB of RES memory and this continuously increases.

strib commented 4 years ago

Hmm, thanks for trying and for the info. When kbfsfuse is at a lot of RES memory can you please a) copy the file /keybase/.kbfs_profiles/heap and share it with me like this:

cp /keybase/.kbfs_profiles/heap /keybase/private/strib@github,marcinbukat/

and b) do another keybase log send and report the log ID here? Thanks!

wodz commented 4 years ago

kbfsfuse at around 3GB RES. I sent you /keybase/.kbfs_profiles/heap as requested log id: 31624c49feeb6087812d261c

strib commented 4 years ago

Thanks! Unfortunately I still can't explain, or even see, the memory usage from all of this. The heap reports that it's only using about 1 GB of memory, and there's nothing unusual-looking in the logs.

Could you also share the allocs file with me please, when you get a chance, if it's still happening?

cp /keybase/.kbfs_profiles/allocs /keybase/private/strib@github,marcinbukat/

I wonder if it's allocating and de-allocating too fast for Go's garbage collection to keep up with? Seems unlikely but it's my best theory so far.

Is it still growing, beyond the 3 GB RES?

wodz commented 4 years ago

It hit 12.6GB RES and got killed. I observed memory usage in htop and it fluctuates - I mean usage drops from time to time but globally usage constantly grows. Usage grows also when rsyncing files which are the same on both ends so I guess reading from kbfs has something to do with this.