urbit / vere

An implementation of the Urbit runtime
MIT License
59 stars 39 forks source link

Pre-migration ships cannot OTA to latest #218

Open lukebuehler opened 1 year ago

lukebuehler commented 1 year ago

We have two ships in hosting that cannot upgrade to latest. They are both on 1.18.

Attempts to upgrade to the latest %base fail with crashes, this is post pack/meld and on the 4gb loom.

lukebuehler commented 1 year ago

Here are two examples from timwyn-batsym where this happens:

|ota ~litzod
[...]
arvo: beginning upgrade
arvo: compiling hoon
[...]
>   arvo: +load next
lull: ~marlet-hinduc
zuse: ~ligfep-radfyn
vane: %ames: ~sitbus-dotmec
ames: larva reload
vane: %behn: ~sogmut-tonmus
vane: %clay: ~fidpyx-tadmes
~timwyn-batsym:dojo> loom: not enough memory to recenter the guard page
recover: top: meme
home: bailing out
pier: serf unexpectedly shut down
\«behn»
timwyn-batsym@tlon.network>
|ota ~litzod
[...]
kiln: activated install into %garden from [~mister-dister-dozzod-dozzod %garden]
; ~mister-dister-dozzod-dozzod not responding still trying
; ~mister-dister-dozzod-dozzod is ok
kiln: downloading update for %garden from [~mister-dister-dozzod-dozzod %garden]
kiln: finished downloading update for %garden from [~mister-dister-dozzod-dozzod %garden]
error in %hood
%arvo-response
! /:<[276 16].[276 20]>
! /:<[276 11].[276 29]>
! /sur/hood/hoon:<[130 3].[134 17]>
! /sur/hood/hoon:<[129 3].[134 17]>
; ~zod not responding still trying
; ~zod is ok
kiln: beginning install into %base from [~litzod %kids]
kiln: activated install into %base from [~litzod %kids]
kiln: downloading update for %base from [~litzod %kids]
kiln: finished downloading update for %base from [~litzod %kids]
kiln: OTA blocked on {%landscape %webterm %garden %rumors %pals %channel}
[...]
>   arvo: +load next
lull: ~marlet-hinduc
zuse: ~ligfep-radfyn
vane: %ames: ~sitbus-dotmec
ames: larva reload
vane: %behn: ~sogmut-tonmus
vane: %clay: ~fidpyx-tadmes
loom: not enough memory to recenter the guard page
recover: dig: meme
kiln: merge into %base succeeded
kiln: merging %base into %kids
loom: can't move the guard page to the same location (base address 0x2ffe14000)
recover: dig: meme
strange page: 262100, at 0x2fff53e04, off 3ffd4f81
Assertion '0' failed in pkg/noun/events.c:292
Assertion failed: 0 (pkg/noun/events.c: u3e_fault: 292)
pier: serf unexpectedly shut down
lukebuehler commented 1 year ago

also, if I try |install ~papt %garden or %landscape it fails too, although not with a bail meme

(edit from zach: deleting long output here because this isn't relevant, it's not a bug but fails due to kelvin - we should have a sane error though)

lukebuehler commented 1 year ago

@zalberico this is what we've seen in the support tickets. @philipcmonk could you look into this?

lukebuehler commented 1 year ago

here is |mass of timwyn-batsym:

> |mass
>=
kick: lost %mass on /arvo
~timwyn-batsym:dojo> KB/307.104
    %tri: KB/285.148
    %qua: KB/945.892
    %pen: MB/56.891.836
  --MB/58.466.316
  %hex: KB/999.968
  %pit: B/264
  %lull:
    %dot: MB/1.172.820
    %typ: B/192
  --MB/1.173.012
  %zuse:
    %dot: MB/1.939.088
    %typ: MB/1.325.640
  --MB/3.264.728
  %vane:
    %ames:
      %peers-known: MB/3.267.752
      %peers-alien: KB/10.736
      %dot: MB/7.737.236
      %typ: KB/21.840
      %sac: KB/362.268
    --MB/11.399.832
    %behn:
      %timers: KB/6.700
      %dot: KB/382.420
      %typ: KB/5.712
      %sac: KB/98.616
    --KB/493.448
    %clay:
      %object-store:
        %commits: MB/123.877.276
        %pages: MB/143.444.700
      --MB/267.321.976
      %domestic:
        %base:
          %mime: B/0
          %flue: MB/27.627.776
          %dojo: KB/12.104
        --MB/27.639.880
        %bitcoin:
          %mime: B/0
          %flue: MB/5.283.232
          %dojo: MB/8.376.284
        --MB/13.659.516
        %channel:
          %mime: B/0
          %flue: MB/2.894.728
          %dojo: MB/2.710.088
        --MB/5.604.816
        %garden:
          %mime: B/0
          %flue: MB/4.948.864
          %dojo: MB/3.812.012
        --MB/8.760.876
        %home:
          %mime: KB/64.408
          %flue: B/24
          %dojo: MB/10.272.200
        --MB/10.336.632
        %kids:
          %mime: B/0
          %flue: KB/5.400
          %dojo: KB/4.756
        --KB/10.156
        %landscape:
          %mime: B/0
          %flue: MB/19.319.808
          %dojo: KB/85.520
        --MB/19.405.328
        %pals:
          %mime: B/0
          %flue: MB/18.192.056
          %dojo: KB/1.960
        --MB/18.194.016
        %rumors:
          %mime: B/0
          %flue: KB/998.860
          %dojo: KB/1.828
        --MB/1.000.688
        %webterm:
          %mime: B/0
          %flue: B/24
          %dojo: KB/1.524
        --KB/1.548
        %zone:
          %mime: B/0
          %flue: B/24
          %dojo: KB/1.116
        --KB/1.140
      --MB/104.614.596
      %foreign: MB/32.643.920
      %ford-cache: KB/139.152
      %dot: MB/7.161.288
      %typ: KB/8.544
      %sac: KB/299.496
    --MB/412.188.972
    %dill:
      %hey: B/120
      %dug: B/696
      %dot: KB/393.796
      %typ: KB/202.344
      %sac: KB/120.480
    --KB/717.436
    %eyre:
      %bindings: KB/7.648
      %auth: KB/2.116
      %connections: B/0
      %channels: KB/15.456
      %axle: KB/2.092
      %dot: MB/2.732.416
      %typ: KB/3.320
      %sac: KB/75.536
    --MB/2.838.584
    %gall:
      %foreign: KB/16.408
      %blocked:
        %channel-server: B/576
        %chat: B/592
        %contact-pull-hook: KB/4.052
        %contact-store: KB/1.032
        %docket: KB/24.248
        %engram: KB/14.368
        %file-server: KB/25.740
        %gather: KB/1.164
        %goal-store: B/564
        %gora: B/568
        %groups: KB/1.128
        %hark-graph-hook: KB/1.008
        %hark-store: KB/3.072
        %hark-system-hook: B/792
        %minesweeper: KB/12.472
        %rumors: B/432
        %settings-store: KB/1.032
        %spaces: KB/1.168
        %sphinx: B/540
        %status: B/540
        %treaty: KB/2.032
        %whom: KB/1.104
      --KB/98.224
      %active:
        %acme: KB/2.688
        %azimuth: MB/121.948.008
        %btc-wallet: KB/1.332
        %channel-client: KB/127.824
        %chat-cli: KB/1.692
        %chat-hook: KB/1.016
        %chat-store: B/952
        %chat-view: B/952
        %clock: B/948
        %contact-hook: B/900
        %contact-pull-hook: KB/9.804
        %contact-push-hook: KB/3.008
        %contact-store: MB/1.735.072
        %contact-view: B/900
        %dbug: KB/1.560
        %dm-hook: B/988
        %docket: MB/24.081.708
        %dojo: KB/30.588
        %eth-watcher: MB/15.483.036
        %file-server: MB/20.684.472
        %glob: KB/27.172
        %goad: KB/1.172
        %graph-pull-hook: KB/39.124
        %graph-push-hook: KB/2.568
        %graph-store: MB/225.520.152
        %group-pull-hook: KB/7.912
        %group-push-hook: KB/2.344
        %group-store: MB/1.205.008
        %group-view: KB/1.816
        %hark-chat-hook: B/956
        %hark-graph-hook: KB/54.124
        %hark-group-hook: KB/4.264
        %hark-invite-hook: KB/1.224
        %hark-store: KB/370.068
        %hark-system-hook: KB/1.164
        %herm: KB/1.416
        %hood: KB/30.804
        %invite-hook: B/944
        %invite-store: KB/3.648
        %invite-view: B/900
        %launch: KB/1.816
        %lens: KB/2.040
        %metadata-hook: B/900
        %metadata-pull-hook: KB/408.448
        %metadata-push-hook: KB/2.496
        %metadata-store: KB/60.328
        %notify: KB/1.440
        %observe-hook: KB/6.068
        %pals: KB/2.052
        %ping: KB/1.704
        %rumors: KB/175.264
        %s3-store: KB/1.040
        %sane: B/900
        %settings-store: KB/4.240
        %spider: KB/3.888
        %treaty: KB/47.508
        %weather: KB/144.256
      %dot: MB/3.121.136
      %typ: KB/4.056
      %sac: KB/94.044
    --MB/415.592.484
    %iris:
      %nex: B/0
      %outbound: B/160
      %by-id: B/0
      %by-duct: B/0
      %axle: B/132
      %dot: KB/155.256
      %typ: KB/116.256
      %sac: KB/22.680
    --KB/294.484
    %jael:
      %pki: MB/28.755.472
      %etn: KB/12.204
      %dot: MB/1.381.692
      %typ: KB/5.136
      %sac: KB/56.160
    --MB/30.210.664
    %khan:
      %dot: KB/127.360
      %typ: KB/91.444
      %sac: KB/17.160
    --KB/235.964
  --MB/873.971.868
--MB/937.876.156
total userspace: MB/937.876.156
  kernel: KB/23.196
total arvo stuff: KB/23.196
  warm jet state: KB/169.596
  cold jet state: MB/344.380.844
  hank cache: B/712
  battery hash cache: B/288
  call site cache: B/524
  hot jet state: KB/193.860
total jet stuff: MB/344.745.824
  bytecode programs: MB/1.920.232
  bytecode cache: KB/163.600
total nock stuff: MB/2.083.832
  trace buffer: B/36
  memoization cache: B/288
total road stuff: B/324
space profile: KB/14.448
total marked: GB/1.284.743.780
free lists: MB/2.328.700
sweep: GB/1.284.743.780

>
>
; ~litzod not responding still trying
; ~litzod is ok
~timwyn-batsym:dojo>

For the other ship, |mass completely crashes the ship and I cannot capture the console output because the shell buffer overflows with a forever binary dump.

zalberico commented 1 year ago

The garden/landscape install failure is expected because they are blocked on the new kelvin from base.

The problem is the base install failure.

zalberico commented 1 year ago

@philipcmonk and I are looking at this now, we both have configured access via https://github.com/tloncorp/hosting-infrastructure successfully (I added a couple quick issues/PRs there for setup stuff we hit that I can fix after).

I'm not sure how to test with 4gb. We also need to be able to run it while also running gdb - can we do that by logging in with two maintenance shells? @lukebuehler?

zalberico commented 1 year ago

Installing gdb looks like it requires pulling down the ships locally, we'll want to figure out how to do that properly with the scripts.

In the meantime, we're trying the following:

  1. .^(* %cw /=kids=) on ~litzod which outputs [130 170.141.184.506.065.177.598.972.720.686.873.706.496]
  2. +cat /=kids/130/sys/kelvin on ~litzod <- uses the 130 from the above command to get the zuse which outputs [%zuse 415]
  3. We then decrement the version to find the zuse we want on ~litzod: +cat /=kids/129/sys/kelvin
  4. Eventually we get +cat /=kids/126/sys/kelvin which is the first to output [%zuse 418]
  5. Then on the ship we want to upgrade we run |merge %base ~litzod %kids, =cas ud+127, =gem %only-that to upgrade us to just 417
  6. This will then just do one %base update.
  7. That's likely enough to just go to 415, but we can also do 416 specifically in the same way.

We suspect the memory failures are because 417 and 416 are both high memory updates (migration to groups 2, and doccords) and doing both is not being smart about memory usage. @philipcmonk also suspects one of the memory errors could itself be the issue (the error is wrong or throwing incorrectly).

We'll update here with the result of the test of upgrading %base to 417 first.

zalberico commented 1 year ago

That worked! @lukebuehler this unblocks us for fixing all of these ships in the way described.

We still want to write up and figure out the following:

zalberico commented 1 year ago

It is important for us to understand this because if it is what I suspect, then everybody currently still on krljr (for %base) will be stuck there. This is probably a lot of people that haven't booted up their ship in a while.

jaesharp commented 1 year ago

Using -send-task-take-gift [%warp (sein:title our now our) %kids ~ %sing %w da+now /] it is possible to get the %kids desk clay revision to use with -send-task-take-gift [%warp (sein:title our now our) %kids ~ %sing %x ud+REVISION /sys/kelvin] substituting REVISION with the writ response to the warp'd %w task (see innermost q) to the remote's revision minus one until we get the target revision we want to use for |merge %base (sein:title our now our) %kids, =cas ud+TARGET, =gem %only-that substituting TARGET for the target revision we discovered (in the case of ~litzod it's 127 per above, for ~zod it's 125, but using sein:title our now our searches the ship's sponsor instead of a fixed party - could easily be modified to search the existing desk's sync source for the ota update target). Modifying |ota to merge kelvin by kelvin down, in the case multiple kelvin versions are pending from the sync source, (controlled via a configurable strategy similar to merge) might be an approach to consider to keep the loom usage to a minimum for |ota.

The -send-task-take-gift thread is defined at https://github.com/urbit/docs/blob/4ec23b39378221a18841516afeac3ee9f65b0d59/arvo/clay/examples.md#introduction

wbsquared commented 1 year ago

Installing gdb looks like it requires pulling down the ships locally, we'll want to figure out how to do that properly with the scripts.

In the meantime, we're trying the following:

  • Specifically only do the upgrade to 417 from 418, we can do that using a similar command to what we did to upgrade the ships that were from before content distribution. Specifically we did this:
  1. .^(* %cw /=kids=) which outputs [130 170.141.184.506.065.177.598.972.720.686.873.706.496]
  2. +cat /=kids/130/sys/kelvin <- uses the 130 from the above command to get the zuse which outputs [%zuse 415]
  3. We then decrement the version to find the zuse we want: +cat /=kids/129/sys/kelvin
  4. Eventually we get +cat /=kids/126/sys/kelvin which is the first to output [%zuse 418]
  5. Then we run |merge %base ~litzod %kids, =cas ud+127, =gem %only-that to upgrade us to just 417
  6. This will then just do one %base update.
  7. That's likely enough to just go to 415, but we can also do 416 specifically in the same way.

We suspect the memory failures are because 417 and 416 are both high memory updates (migration to groups 2, and doccords) and doing both is not being smart about memory usage. @philipcmonk also suspects one of the memory errors could itself be the issue (the error is wrong or throwing incorrectly).

We'll update here with the result of the test of upgrading %base to 417 first.

When I do this and decrement, it remains at %418 for a few (starting at 17, running down to 11) and then at 10 it outputs [%zuse 419] - I never see anything other than 418 and 419 with these outputs. is this expected? running the update using values 11 and 10 still aborts due to the same error. Am I missing something?

wbsquared commented 1 year ago

FYI my output to the original command was:

[17 170.141.184.505.844.992.463.969.352.820.537.163.776]

And when I entered step 2 I got an output of [%zuse 418]

not 415 as you indicated

gorjes2300 commented 1 year ago

Hello, I am currently having a similar issue. I rebooted my pier after some time and tried to follow the above steps to update my Zuse version from 418 to 417, but I am unable to use |ota ~litzod to update it any further. Below is the error message:

arvo: |ota ~litzod >\= kiln: activated install into %base from ~litzod/%kids kiln: downloading update for %base from ~litzod/%kids kiln: finished downloading update for %base from ~litzod/%kids clay: wait-for-kelvin, [need=[%zuse %417] have={[lal=%zuse num=413]}] kiln: merge into %base succeeded kiln: kids merge into [~ u=%kids] clay: wait-for-kelvin, [need=[%zuse %417] have={[lal=%zuse num=413]}] kiln: kids merge to %kids succeeded clay: wait-for-kelvin, [need=[%zuse %417] have={[lal=%zuse num=413]}]` Would really appreciate any help here - thanks again.