urbit / arvo

https://github.com/urbit/urbit
110 stars 57 forks source link

The plight of zuse.hoon:2701 - jam'ing very slow #709

Closed mikolajpp closed 6 years ago

mikolajpp commented 6 years ago

Having been deploying plato for some time now, I have accustomed myself with effects of extensive merges and syncs. It seems (no evidence except statistical stack traces which always hit jam at zuse.hoon:2701) that it is very slow because of packet encryption?

Whenever I let my clay history get slightly larger (say ~50 commits to clay of few hoon files), the merges become unbearably long, in the order of half an hours, or hours even.

The only help for now have been wrecking moons left and right and putting the code from scratch on a new one.

Am I hitting some strange bug, or is it considered a normal operation presently? I would think since jam is jetted, it should have negligible impact.

ohAitch commented 6 years ago

Is this urbit/urbit#936 perhaps? I think the bottleneck there was equality testing, which ++jam uses.

cgyarvin commented 6 years ago

Pretty sure that analysis is correct. Are you using the latest vere here? We fixed some problems, but I'm sure Paul would love to hear that his heuristic equality tester is acting up again :-/

On Tue, Apr 10, 2018 at 8:39 AM, Anton Dyudin notifications@github.com wrote:

Is this urbit/urbit#936 https://github.com/urbit/urbit/issues/936 perhaps? I think the bottleneck there was equality testing, which ++jam uses.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-380146958, or mute the thread https://github.com/notifications/unsubscribe-auth/AALyAd86xgfZz7Nk9RMUrchfrCa68g7yks5tnNI2gaJpZM4TOczR .

frodwith commented 6 years ago

Jam being jetted doesn't mean that it is fast. If it has a lot of work to do, it will take a while. There's a possibility that there is a significant performance regression from the previous commits (mentioned in the linked issue). The way to test is to try with the previous commits and see if it has the same problems. With reproducible failing cases, improvement is possible. Your help in discovering such cases is much appreciated, @mikolajpp.

cgyarvin commented 6 years ago

True — but I don’t think anything Mikolaj is doing here should be jam-bound.

Seems like a case for deeper tool usage or improvement, not haphazard folk bisection. Isn’t the profiler already instrumented to detect time spent in the comparator? It just feels like we are not approaching this with adequate seriousness and rigor, which is why the wound keeps reopening.

Sent from my iPhone

On Apr 10, 2018, at 11:46 AM, Paul Driver notifications@github.com wrote:

Jam being jetted doesn't mean that it is fast. If it has a lot of work to do, it will take a while. There's a possibility that there is a significant performance regression from the previous commits (mentioned in the linked issue). The way to test is to try with the previous commits and see if it has the same problems. With reproducible failing cases, improvement is possible. Your help in discovering such cases is much appreciated, @mikolajpp.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

frodwith commented 6 years ago

The changes to jam and equality are related but separate, and oft confused in the literature, but I will continue speaking of them as one issue.

Isn’t the profiler already instrumented to detect time spent in the comparator?

Yes. I used that profiling to guide my previous efforts in solving #936, by looking at the failing cases under profiling both before and after the changes. I don't know if @mikolajpp's cases are jam-bound, equality bound, or otherwise without cases to profile.

mikolajpp commented 6 years ago

I was using vere from around the time we got first working meson build. Now I have updated it, but don't really a way to test it.

Fresh merges are always fast, no matter the history. The problem only appears after initial merge and with integration of a more extensive history.

Once fakenet local networking works again I will try to reproduce it somehow, if it is still reproducible.

vvisigoth commented 6 years ago

Was wrestling with a problem on the testnet and ran into this in the wild. Reverting to vere 9978ad8b7204ec984cfa8ce2552af4dca7b7061a seemed to fix it.

joemfb commented 6 years ago

I'm hitting this bug trying to boot ~master-morzod into the new era. @keatondunsford, do you know if anyone else has been having trouble. I'm running vere from latest master.

frodwith commented 6 years ago

I'm curious whether this allocation bug I ran into recently (which made a lot of u3h_puts in a loop very slow) could be having an impact here, since that is something that jam does. this pull request should fix the allocation bug, maybe it will have some effect on whatever this is?: https://github.com/urbit/urbit/pull/987

joemfb commented 6 years ago

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

frodwith commented 6 years ago

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

frodwith commented 6 years ago

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

cgyarvin commented 6 years ago

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

vvisigoth commented 6 years ago

This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.

On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread < https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

vvisigoth commented 6 years ago

Getting started on this. Reproducing on the MVE testnet, just FYI.

A

— ~poldec-tonteg http://urbit.org

On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:

This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.

On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_ nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

vvisigoth commented 6 years ago

J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.

I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.

— ~poldec-tonteg http://urbit.org

On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:

Getting started on this. Reproducing on the MVE testnet, just FYI.

A

— ~poldec-tonteg http://urbit.org

On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:

This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.

On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan <notifications@github.com

wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

frodwith commented 6 years ago

Not including -O3 will have some effect on performance...

On Mon, May 28, 2018 at 12:45 PM, Anthony Arroyo notifications@github.com wrote:

J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.

I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.

— ~poldec-tonteg http://urbit.org

On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:

Getting started on this. Reproducing on the MVE testnet, just FYI.

A

— ~poldec-tonteg http://urbit.org

On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:

This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.

On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan < notifications@github.com

wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/urbit/arvo/issues/709#issuecomment-392388068 , or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_ IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392593199, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp-D4P7liGVNpJVU5IzMhvT2eBxXEks5t3FPPgaJpZM4TOczR .

frodwith commented 6 years ago

i think release will use -O2 by default, based on the following snippet from http://mesonbuild.com/Running-Meson.html

plain no extra build flags are used, even for compiler warnings, useful for distro packagers and other cases where you need to specify all arguments by yourself debug debug info is generated but the result is not optimized, this is the default debugoptimized debug info is generated and the code is optimized (on most compilers this means -g -O2) release full optimization, no debug info

On Mon, May 28, 2018 at 12:49 PM, Paul Driver frodwith@gmail.com wrote:

Not including -O3 will have some effect on performance...

On Mon, May 28, 2018 at 12:45 PM, Anthony Arroyo <notifications@github.com

wrote:

J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.

I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.

— ~poldec-tonteg http://urbit.org

On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:

Getting started on this. Reproducing on the MVE testnet, just FYI.

A

— ~poldec-tonteg http://urbit.org

On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:

This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.

On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan < notifications@github.com

wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/urbit/arvo/issues/709#issuecomment-392388068 , or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj 8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392593199, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp-D4P7liGVNpJVU5IzMhvT2eBxXEks5t3FPPgaJpZM4TOczR .

cgyarvin commented 6 years ago

This can’t be the cause (it’s Paul’s fix — didn’t you see it before he sent it?)

Sent from my iPhone

On May 28, 2018, at 12:55 PM, Paul Driver notifications@github.com wrote:

i think release will use -O2 by default, based on the following snippet from http://mesonbuild.com/Running-Meson.html

plain no extra build flags are used, even for compiler warnings, useful for distro packagers and other cases where you need to specify all arguments by yourself debug debug info is generated but the result is not optimized, this is the default debugoptimized debug info is generated and the code is optimized (on most compilers this means -g -O2) release full optimization, no debug info

On Mon, May 28, 2018 at 12:49 PM, Paul Driver frodwith@gmail.com wrote:

Not including -O3 will have some effect on performance...

On Mon, May 28, 2018 at 12:45 PM, Anthony Arroyo <notifications@github.com

wrote:

J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.

I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.

— ~poldec-tonteg http://urbit.org

On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:

Getting started on this. Reproducing on the MVE testnet, just FYI.

A

— ~poldec-tonteg http://urbit.org

On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:

This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.

On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:

Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?

Sent from my iPhone

On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:

Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.

On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:

I think other people are hitting it. I think Anthony's got it on his list.

On Sun, May 27, 2018 at 4:35 PM, Joe Bryan < notifications@github.com

wrote:

I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/urbit/arvo/issues/709#issuecomment-392388068 , or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj 8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392593199, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp-D4P7liGVNpJVU5IzMhvT2eBxXEks5t3FPPgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

frodwith commented 6 years ago

anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709

vvisigoth commented 6 years ago

Whoop sorry.

On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:

anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/26996c2b1647c0017bea2ab8789768daf0208709

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org

cgyarvin commented 6 years ago

The opt level now is set in the bowels of meson as far as I can tell.

Interesting that O2 is that different...

Sent from my iPhone

On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:

Whoop sorry.

On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:

anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/26996c2b1647c0017bea2ab8789768daf0208709

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

frodwith commented 6 years ago

I recommend reverting this commit. I mess with these flags arrays all the time when I'm hacking vere to debug things anyway.

On Mon, May 28, 2018 at 1:15 PM, cgyarvin notifications@github.com wrote:

The opt level now is set in the bowels of meson as far as I can tell.

Interesting that O2 is that different...

Sent from my iPhone

On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:

Whoop sorry.

On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:

anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/26996c2b1647c0017bea2ab8789768 daf0208709

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_ EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597144, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRpxLULLSza3PzHrxqxxsrg1nDPZAIks5t3FrxgaJpZM4TOczR .

vvisigoth commented 6 years ago

Can confirm that latest with this commit reverted builds like whoa and doesn't seem to dig over.

Any objections to reverting?

— ~poldec-tonteg http://urbit.org

On Mon, May 28, 2018 at 1:18 PM, Paul Driver notifications@github.com wrote:

I recommend reverting this commit. I mess with these flags arrays all the time when I'm hacking vere to debug things anyway.

On Mon, May 28, 2018 at 1:15 PM, cgyarvin notifications@github.com wrote:

The opt level now is set in the bowels of meson as far as I can tell.

Interesting that O2 is that different...

Sent from my iPhone

On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:

Whoop sorry.

On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:

anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768 daf0208709

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_ EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597144, or mute the thread https://github.com/notifications/unsubscribe-auth/ AADRpxLULLSza3PzHrxqxxsrg1nDPZAIks5t3FrxgaJpZM4TOczR

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597438, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4wOb4OxJIhhjDZmCtIvqaDVPPt0eks5t3FuCgaJpZM4TOczR .

cgyarvin commented 6 years ago

Make sure the readme gets reverted as well. We’ll figure it out, this ain’t closed.

Sent from my iPhone

On May 28, 2018, at 3:57 PM, Anthony Arroyo notifications@github.com wrote:

Can confirm that latest with this commit reverted builds like whoa and doesn't seem to dig over.

Any objections to reverting?

— ~poldec-tonteg http://urbit.org

On Mon, May 28, 2018 at 1:18 PM, Paul Driver notifications@github.com wrote:

I recommend reverting this commit. I mess with these flags arrays all the time when I'm hacking vere to debug things anyway.

On Mon, May 28, 2018 at 1:15 PM, cgyarvin notifications@github.com wrote:

The opt level now is set in the bowels of meson as far as I can tell.

Interesting that O2 is that different...

Sent from my iPhone

On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:

Whoop sorry.

On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:

anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768 daf0208709

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_ EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .

-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597144, or mute the thread https://github.com/notifications/unsubscribe-auth/ AADRpxLULLSza3PzHrxqxxsrg1nDPZAIks5t3FrxgaJpZM4TOczR

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597438, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4wOb4OxJIhhjDZmCtIvqaDVPPt0eks5t3FuCgaJpZM4TOczR .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

joemfb commented 6 years ago

With this revert in place on latest master, I can now boot a comet without any stack overflows. I'll try later with my planet -- I expect that to work as well. Thanks!

joemfb commented 6 years ago

So, I'm still having what appears to be a variation of this problem with my ~planet.

Originally, for both planet and comet, I was getting this failure, looping indefinitely:

|«ames»ng hood, this may take a few minutes>
recover: dig: over
over
[%swim-call-vane %a %soft ~]
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.673 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.674 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.675 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.676 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.677 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.678 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 13].[2.679 51]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 17].[2.679 51]>

After updating vere to build with -O3, I was able to boot a new comet successfully. (I didn't try continuing on with the same comet.)

I tried my planet again with -w/-t, and was able to successfully %dill-init as a comet, but can't complete the upgrade to the planet. It loops indefinitely with this error:

[%dill-init ~master-morzod %hood]
activated sync from %kids on ~binzod to %base
activated sync from %base on ~master-morzod to %home
sync succeeded from %base on ~master-morzod to %home
activated sync from %base on ~master-morzod to %kids
sync succeeded from %base on ~master-morzod to %kids
>> [%mo-not-running %talk %peer]
>> [%mo-not-running %dojo %peer]
>> [%mo-not-running %hall %peer]
>> [%mo-not-running %hall %peer]
'initial merge succeeded'
[%vega-start-hoon /~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon]
<awaiting hood, this may take a few minutes>
recover: dig: over
over
/~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon

I'm not sure how to debug that -- I don't think I've seen "recover: dig: over" without a stacktrace before.

Also, since I'm not crystal clear on the multi-homing identity upgrade semantics, has my planet completed the ticket handshake with its star? How would I check that from the star side?

cgyarvin commented 6 years ago

Your planet definitely exists. What’s on that stack? Can you get a C trace?

Sent from my iPhone

On May 30, 2018, at 2:06 PM, Joe Bryan notifications@github.com wrote:

So, I'm still having what appears to be a variation of this problem with my ~planet.

Originally, for both planet and comet, I was getting this failure, looping indefinitely:

|«ames»ng hood, this may take a few minutes> recover: dig: over over [%swim-call-vane %a %soft ~] /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.673 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.674 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.675 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.676 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.677 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.678 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 13].[2.679 51]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 17].[2.679 51]> After updating vere to build with -O3, I was able to boot a new comet successfully. (I didn't try continuing on with the same comet.)

I tried my planet again with -w/-t, and was able to successfully %dill-init as a comet, but can't complete the upgrade to the planet. It loops indefinitely with this error:

[%dill-init ~master-morzod %hood] activated sync from %kids on ~binzod to %base activated sync from %base on ~master-morzod to %home sync succeeded from %base on ~master-morzod to %home activated sync from %base on ~master-morzod to %kids sync succeeded from %base on ~master-morzod to %kids

[%mo-not-running %talk %peer] [%mo-not-running %dojo %peer] [%mo-not-running %hall %peer] [%mo-not-running %hall %peer] 'initial merge succeeded' [%vega-start-hoon /~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon] <awaiting hood, this may take a few minutes> recover: dig: over over /~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon I'm not sure how to debug that -- I don't think I've seen "recover: dig: over" without a stacktrace before.

Also, since I'm not crystal clear on the multi-homing identity upgrade semantics, has my planet completed the ticket handshake with its star? How would I check that from the star side?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

joemfb commented 6 years ago

Here's some stacktraces procured by running the planet in lldb and hitting ^c:

Neither build had the same interim printfs (ie, %dill-init and so on) -- I'm not sure if that's because the lldb terminal environment, or some more significant difference. The debug build overflowed repeatedly and quickly, and the stack was different each time I interrupted it. The release build consistently had the u3v_poke and lots of _nock_on on the stack if I interrupted it, but would crash itself in u3wa_sub if I did not.

I'm not sure how to interpret any of that. Happy to try other things, or pack up the pier if you'd like to see it.

cgyarvin commented 6 years ago

Hitting ^C at some random moment tells us absolutely nothing. These stack traces are obviously not deep.

The question is what the stack looks like when it crashes. You'll obviously need to disable the stack overflow handler to get that trace.

On Wed, May 30, 2018 at 3:09 PM, Joe Bryan notifications@github.com wrote:

Here's some stacktraces procured by running the planet in lldb and hitting ^c:

Neither build had the same interim printfs (ie, %dill-init and so on) -- I'm not sure if that's because the lldb terminal environment, or some more significant difference. The debug build overflowed repeatedly and quickly, and the stack was different each time I interrupted it. The release build consistently had the u3v_poke and lots of _nock_on on the stack if I interrupted it, but would crash itself in u3wa_sub if I did not.

I'm not sure how to interpret any of that. Happy to try other things, or pack up the pier if you'd like to see it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-393336075, or mute the thread https://github.com/notifications/unsubscribe-auth/AALyAXqvzFt6OUevQ1tdXu1nOqkN7Rjgks5t3xi2gaJpZM4TOczR .

joemfb commented 6 years ago

In manage.c, #define NO_OVERFLOW.

Core dump from release build:

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x000000010d5f5bd4 urbit`u3r_mug + 4
    frame #1: 0x000000010d5f5ca3 urbit`u3r_mug + 211
    frame #2: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #3: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #4: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #5: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #6: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #7: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #8: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #9: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #10: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #11: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #12: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #13: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #14: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #15: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #16: 0x000000010d5f5cad urbit`u3r_mug + 221
    frame #17: 0x000000010d5f5cad urbit`u3r_mug + 221

Release build run in lldb:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff72538b6e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff72703080 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff724941ae libsystem_c.dylib`abort + 127
    frame #3: 0x000000010002130a urbit`u3m_bail + 58
    frame #4: 0x000000010002178a urbit`u3m_error + 26
    frame #5: 0x000000010002aa3b urbit`u3wa_sub + 91
    frame #6: 0x000000010001fbd3 urbit`_cj_kick_z + 179
    frame #7: 0x000000010001f760 urbit`u3j_kick + 528
    frame #8: 0x0000000100023ad9 urbit`_n_nock_on + 249
    frame #9: 0x0000000100023ef9 urbit`_n_nock_on + 1305
    frame #10: 0x0000000100023ee4 urbit`_n_nock_on + 1284
    frame #11: 0x0000000100023ef9 urbit`_n_nock_on + 1305
    frame #12: 0x0000000100023acc urbit`_n_nock_on + 236
    frame #13: 0x0000000100023b65 urbit`_n_nock_on + 389
    frame #14: 0x0000000100023ee4 urbit`_n_nock_on + 1284
    frame #15: 0x0000000100023ef9 urbit`_n_nock^D

Core dump from debug build (much quicker crash):

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x000000010c285a38 urbit`_song_x(a=3221668534, b=3221668534, uni=(urbit`_sung_one at retrieve.c:547)) at retrieve.c:763
    frame #1: 0x000000010c2862f0 urbit`u3r_sung(a=3221668534, b=3221668534) at retrieve.c:884
    frame #2: 0x000000010c2769dc urbit`_ch_node_gut(han_u=0x00000002030abafc, lef_w=15, rem_w=25359, key=3221668534) at hashtable.c:646
    frame #3: 0x000000010c276aad urbit`_ch_node_gut(han_u=0x000000020341d240, lef_w=20, rem_w=90895, key=3221668534) at hashtable.c:659
    frame #4: 0x000000010c27689a urbit`u3h_gut(har_p=14131959, key=3221668534) at hashtable.c:694
    frame #5: 0x000000010c278f3b urbit`u3j_find(bat=3221668534) at jets.c:206
    frame #6: 0x000000010c2795d9 urbit`u3j_kick(cor=3221702479, axe=8) at jets.c:613
    frame #7: 0x000000010c283c4e urbit`_n_nock_on(bus=3756596240, fol=3221664515) at nock.c:341
    frame #8: 0x000000010c283bcf urbit`_n_nock_on(bus=3756596240, fol=3221863267) at nock.c:320
    frame #9: 0x000

Debug build running in lldb:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSEGV
  * frame #0: 0x000000010003ea92 urbit`_song_x(a=143, b=143, uni=(urbit`_sang_one at retrieve.c:456)) at retrieve.c:766
    frame #1: 0x000000010003e9c0 urbit`u3r_sang(a=143, b=143) at retrieve.c:849
    frame #2: 0x000000010003f28c urbit`u3r_sing(a=143, b=143) at retrieve.c:863
    frame #3: 0x00000001000339fe urbit`_cj_fine(cup=3234234471, mop=3234234465, cor=3221702449) at jets.c:564
    frame #4: 0x0000000100034213 urbit`_cj_fine(cup=3234234471, mop=3234234465, cor=3221702449) at jets.c:591
    frame #5: 0x0000000100034213 urbit`_cj_fine(cup=3228621188, mop=3230740566, cor=3221702479) at jets.c:591
    frame #6: 0x0000000100032df5 urbit`u3j_kick(cor=3756595634, axe=2) at jets.c:643
    frame #7: 0x000000010003cc4e urbit`_n_nock_on(bus=3756596276, fol=3221644123) at nock.c:341
    frame #8: 0x000000010003c6af urbit`_n_nock_on(bus=3756595592, fol=3221782497) at nock.c:206
    frame #9: 0x000000010003c6d2 urbit`_n_nock_on(bu) 

Those are complete backtraces in each case -- I assuming the stack is being clobbered somehow.

joemfb commented 6 years ago

Note: none of those were manually interrupted. Each time, I just waited for the crash.

cgyarvin commented 6 years ago

Wait wut? I thought you said ^c? Look at the stack, it’s only 1500 frames deep... the plot thickens...

Sent from my iPhone

On May 30, 2018, at 3:41 PM, Joe Bryan notifications@github.com wrote:

Note: none of those were manually interrupted. Each time, I just waited for the crash.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

joemfb commented 6 years ago

Again my clarifications serve only to confuse... The pastebin traces were created through manual interrupts. The four short traces in my recent comment are from disabling the overflow handler and running at different optimization levels, directly, and in lldb. Each time, I waited for vere to crash, and then printed the backtrace, directly or from the core file.

cgyarvin commented 6 years ago

There’s clearly some real corruption here. Hopefully it’s the same thing as our other corruption bug.

There are two approaches: get a better read on which change introduced this, probably with more careful and aggressive bisecting; or debug the memory pool from first principles.

I’d try #1 first and it’s probably a better time investment than any HTTP work :-(. I do know how to debug memory corruption from first principles (instrument with sanity checks until you find where the error is introduced) but it ain’t easy or fun.

Sent from my iPhone

On May 30, 2018, at 3:41 PM, Joe Bryan notifications@github.com wrote:

Note: none of those were manually interrupted. Each time, I just waited for the crash.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

cgyarvin commented 6 years ago

Not clear that NO_OVERFLOW is working correctly in that case. An overflow trace should be, obviously, giant. Make sure you don’t have corefile truncation issues.

Sent from my iPhone

On May 30, 2018, at 3:50 PM, Joe Bryan notifications@github.com wrote:

Again my clarifications serve only to confuse... The pastebin traces were created through manual interrupts. The four short traces in my recent comment are from disabling the overflow handler and running at different optimization levels, directly, and in lldb. Each time, I waited for vere to crash, and then printed the backtrace, directly or from the core file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ohAitch commented 6 years ago

the

    frame #14: 0x0000000100023ee4 urbit`_n_nock_on + 1284
    frame #15: 0x0000000100023ef9 urbit`_n_nock^D

and

    frame #7: 0x000000010003cc4e urbit`_n_nock_on(bus=3756596276, fol=3221644123) at nock.c:341
    frame #8: 0x000000010003c6af urbit`_n_nock_on(bus=3756595592, fol=3221782497) at nock.c:206
    frame #9: 0x000000010003c6d2 urbit`_n_nock_on(bu) 

seem v suspicious

joemfb commented 6 years ago

I agree, but I'm not sure what to suspect ... ;) It's worth noting that lldb hangs for a minute when asked to print those traces.

I'll come back to this later this evening.

joemfb commented 6 years ago

So, I'm pleased to announce that this is fixed. I'm somewhat less pleased to announce that it was fixed without code changes...

The upside is that I am now well-schooled in the difference between hard and soft limits set via ulimit, and very consciously aware of the fact that vere requests the maximum stack size (ie, the hard limit) in manage.c.

Long story short, I had "doubled" my stack size when I first started hitting the original bug (unoptimized build), and never reset that to the defaults. And when I doubled the stack size, I set both the soft and hard limits to be double the default soft limit. As you can imagine, this was less than ideal:

$ ulimit -s
8192
$ ulimit -Hs
65532
$ ulimit -s 16384
$ ulimit -s
16384
$ ulimit -Hs
16384

Thanks to @frodwith for helping me debug!

cgyarvin commented 6 years ago

Thank you! Yes, ulimit is a usability disaster...

Sent from my iPhone

On May 31, 2018, at 3:40 PM, Joe Bryan notifications@github.com wrote:

So, I'm pleased to announce that this is fixed. I'm somewhat less pleased to announce that it was fixed without code changes...

The upside is that I am now well-schooled in the difference between hard and soft limits set via ulimit, and very consciously aware of the fact that vere requests the maximum stack size (ie, the hard limit) in manage.c.

Long story short, I had "doubled" my stack size when I first started hitting the original bug (unoptimized build), and never reset that to the defaults. And when I doubled the stack size, I set both the soft and hard limits to be double the default soft limit. As you can imagine, this was less than ideal:

$ ulimit -s 8192 $ ulimit -Hs 65532 $ ulimit -s 16384 $ ulimit -s 16384 $ ulimit -Hs 16384 Thanks to @frodwith for helping me debug!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

joemfb commented 6 years ago

Having irreparably derailed this issue, I'd like to summarize for posterity.

This bug should be fixed by @frodwith's recent rewrite of +jam, which uses an explicit stack and shouldn't overflow.

Fixed. Caused by an accidentally de-optimized build.

Same as above.

False report. Caused by my ulimit foolishness.

I vote to close this issue, and track any subsequent reproductions of the first problem (or any new overflows) in a new ticket.

mikolajpp commented 6 years ago

@joemfb Fine by me.

I will have opportunity to test some larger syncs soon,