Closed mikolajpp closed 6 years ago
Is this urbit/urbit#936 perhaps? I think the bottleneck there was equality testing, which ++jam
uses.
Pretty sure that analysis is correct. Are you using the latest vere here? We fixed some problems, but I'm sure Paul would love to hear that his heuristic equality tester is acting up again :-/
On Tue, Apr 10, 2018 at 8:39 AM, Anton Dyudin notifications@github.com wrote:
Is this urbit/urbit#936 https://github.com/urbit/urbit/issues/936 perhaps? I think the bottleneck there was equality testing, which ++jam uses.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-380146958, or mute the thread https://github.com/notifications/unsubscribe-auth/AALyAd86xgfZz7Nk9RMUrchfrCa68g7yks5tnNI2gaJpZM4TOczR .
Jam being jetted doesn't mean that it is fast. If it has a lot of work to do, it will take a while. There's a possibility that there is a significant performance regression from the previous commits (mentioned in the linked issue). The way to test is to try with the previous commits and see if it has the same problems. With reproducible failing cases, improvement is possible. Your help in discovering such cases is much appreciated, @mikolajpp.
True — but I don’t think anything Mikolaj is doing here should be jam-bound.
Seems like a case for deeper tool usage or improvement, not haphazard folk bisection. Isn’t the profiler already instrumented to detect time spent in the comparator? It just feels like we are not approaching this with adequate seriousness and rigor, which is why the wound keeps reopening.
Sent from my iPhone
On Apr 10, 2018, at 11:46 AM, Paul Driver notifications@github.com wrote:
Jam being jetted doesn't mean that it is fast. If it has a lot of work to do, it will take a while. There's a possibility that there is a significant performance regression from the previous commits (mentioned in the linked issue). The way to test is to try with the previous commits and see if it has the same problems. With reproducible failing cases, improvement is possible. Your help in discovering such cases is much appreciated, @mikolajpp.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
The changes to jam and equality are related but separate, and oft confused in the literature, but I will continue speaking of them as one issue.
Isn’t the profiler already instrumented to detect time spent in the comparator?
Yes. I used that profiling to guide my previous efforts in solving #936, by looking at the failing cases under profiling both before and after the changes. I don't know if @mikolajpp's cases are jam-bound, equality bound, or otherwise without cases to profile.
I was using vere from around the time we got first working meson build. Now I have updated it, but don't really a way to test it.
Fresh merges are always fast, no matter the history. The problem only appears after initial merge and with integration of a more extensive history.
Once fakenet local networking works again I will try to reproduce it somehow, if it is still reproducible.
Was wrestling with a problem on the testnet and ran into this in the wild. Reverting to vere 9978ad8b7204ec984cfa8ce2552af4dca7b7061a seemed to fix it.
I'm hitting this bug trying to boot ~master-morzod
into the new era. @keatondunsford, do you know if anyone else has been having trouble. I'm running vere from latest master.
I'm curious whether this allocation bug I ran into recently (which made a lot of u3h_puts in a loop very slow) could be having an impact here, since that is something that jam does. this pull request should fix the allocation bug, maybe it will have some effect on whatever this is?: https://github.com/urbit/urbit/pull/987
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.
On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread < https://github.com/notifications/unsubscribe-auth/AADRp_nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
Getting started on this. Reproducing on the MVE testnet, just FYI.
A
— ~poldec-tonteg http://urbit.org
On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:
This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.
On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan notifications@github.com wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_ nGvZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.
I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.
— ~poldec-tonteg http://urbit.org
On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:
Getting started on this. Reproducing on the MVE testnet, just FYI.
A
— ~poldec-tonteg http://urbit.org
On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:
This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.
On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan <notifications@github.com
wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388068, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
Not including -O3 will have some effect on performance...
On Mon, May 28, 2018 at 12:45 PM, Anthony Arroyo notifications@github.com wrote:
J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.
I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.
— ~poldec-tonteg http://urbit.org
On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:
Getting started on this. Reproducing on the MVE testnet, just FYI.
A
— ~poldec-tonteg http://urbit.org
On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:
This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.
On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan < notifications@github.com
wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/urbit/arvo/issues/709#issuecomment-392388068 , or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_ IQj8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392593199, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp-D4P7liGVNpJVU5IzMhvT2eBxXEks5t3FPPgaJpZM4TOczR .
i think release will use -O2 by default, based on the following snippet from http://mesonbuild.com/Running-Meson.html
plain no extra build flags are used, even for compiler warnings, useful for distro packagers and other cases where you need to specify all arguments by yourself debug debug info is generated but the result is not optimized, this is the default debugoptimized debug info is generated and the code is optimized (on most compilers this means -g -O2) release full optimization, no debug info
On Mon, May 28, 2018 at 12:49 PM, Paul Driver frodwith@gmail.com wrote:
Not including -O3 will have some effect on performance...
On Mon, May 28, 2018 at 12:45 PM, Anthony Arroyo <notifications@github.com
wrote:
J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.
I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.
— ~poldec-tonteg http://urbit.org
On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:
Getting started on this. Reproducing on the MVE testnet, just FYI.
A
— ~poldec-tonteg http://urbit.org
On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:
This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.
On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan < notifications@github.com
wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/urbit/arvo/issues/709#issuecomment-392388068 , or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj 8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392593199, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp-D4P7liGVNpJVU5IzMhvT2eBxXEks5t3FPPgaJpZM4TOczR .
This can’t be the cause (it’s Paul’s fix — didn’t you see it before he sent it?)
Sent from my iPhone
On May 28, 2018, at 12:55 PM, Paul Driver notifications@github.com wrote:
i think release will use -O2 by default, based on the following snippet from http://mesonbuild.com/Running-Meson.html
plain no extra build flags are used, even for compiler warnings, useful for distro packagers and other cases where you need to specify all arguments by yourself debug debug info is generated but the result is not optimized, this is the default debugoptimized debug info is generated and the code is optimized (on most compilers this means -g -O2) release full optimization, no debug info
On Mon, May 28, 2018 at 12:49 PM, Paul Driver frodwith@gmail.com wrote:
Not including -O3 will have some effect on performance...
On Mon, May 28, 2018 at 12:45 PM, Anthony Arroyo <notifications@github.com
wrote:
J'accuse: https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768daf0208709 Was able to reproduce this reliably on High Sierra 10.13.3 The commit right before this is smooth like butter.
I spoke with Keat and he said that he hasn't seen this at all on Linux, which gives one pause.
— ~poldec-tonteg http://urbit.org
On Mon, May 28, 2018 at 11:00 AM, Anthony Arroyo anthony@tlon.io wrote:
Getting started on this. Reproducing on the MVE testnet, just FYI.
A
— ~poldec-tonteg http://urbit.org
On Sun, May 27, 2018 at 4:51 PM, Anthony Arroyo anthony@tlon.io wrote:
This was already on my list for tomorrow (the bisection). Indisposed right now. Will do tomorrow morning if no one gets a chance sooner.
On Sun, May 27, 2018 at 4:47 PM cgyarvin notifications@github.com wrote:
Anthony, did you ever bisect this? Keaton, can you reproduce and bisect?
Sent from my iPhone
On May 27, 2018, at 4:37 PM, Paul Driver notifications@github.com wrote:
Thanks for trying, by the way. It was just a hunch that fixing the allocation bug might help.
On Sun, May 27, 2018 at 4:36 PM, Paul Driver frodwith@gmail.com wrote:
I think other people are hitting it. I think Anthony's got it on his list.
On Sun, May 27, 2018 at 4:35 PM, Joe Bryan < notifications@github.com
wrote:
I rebuilt with the fix in #987 and tried booting a new comet -- same problem. I'm on OS X High Sierra, with a pretty standard dev setup. I'm surprised no one else is hitting this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/urbit/arvo/issues/709#issuecomment-392388068 , or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp_nGv ZnIk6cMOlAbLNlTkN4ZxvY6ks5t2zhEgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392388631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4_IQj 8B3Ul1LFy9oncIKjUrbjjWkks5t2zr8gaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392593199, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRp-D4P7liGVNpJVU5IzMhvT2eBxXEks5t3FPPgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709
Whoop sorry.
On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:
anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/26996c2b1647c0017bea2ab8789768daf0208709
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org
The opt level now is set in the bowels of meson as far as I can tell.
Interesting that O2 is that different...
Sent from my iPhone
On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:
Whoop sorry.
On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:
anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/26996c2b1647c0017bea2ab8789768daf0208709
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I recommend reverting this commit. I mess with these flags arrays all the time when I'm hacking vere to debug things anyway.
On Mon, May 28, 2018 at 1:15 PM, cgyarvin notifications@github.com wrote:
The opt level now is set in the bowels of meson as far as I can tell.
Interesting that O2 is that different...
Sent from my iPhone
On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:
Whoop sorry.
On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:
anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/26996c2b1647c0017bea2ab8789768 daf0208709
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_ EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597144, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRpxLULLSza3PzHrxqxxsrg1nDPZAIks5t3FrxgaJpZM4TOczR .
Can confirm that latest with this commit reverted builds like whoa and doesn't seem to dig over.
Any objections to reverting?
— ~poldec-tonteg http://urbit.org
On Mon, May 28, 2018 at 1:18 PM, Paul Driver notifications@github.com wrote:
I recommend reverting this commit. I mess with these flags arrays all the time when I'm hacking vere to debug things anyway.
On Mon, May 28, 2018 at 1:15 PM, cgyarvin notifications@github.com wrote:
The opt level now is set in the bowels of meson as far as I can tell.
Interesting that O2 is that different...
Sent from my iPhone
On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:
Whoop sorry.
On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:
anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768 daf0208709
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_ EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597144, or mute the thread https://github.com/notifications/unsubscribe-auth/ AADRpxLULLSza3PzHrxqxxsrg1nDPZAIks5t3FrxgaJpZM4TOczR
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597438, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4wOb4OxJIhhjDZmCtIvqaDVPPt0eks5t3FuCgaJpZM4TOczR .
Make sure the readme gets reverted as well. We’ll figure it out, this ain’t closed.
Sent from my iPhone
On May 28, 2018, at 3:57 PM, Anthony Arroyo notifications@github.com wrote:
Can confirm that latest with this commit reverted builds like whoa and doesn't seem to dig over.
Any objections to reverting?
— ~poldec-tonteg http://urbit.org
On Mon, May 28, 2018 at 1:18 PM, Paul Driver notifications@github.com wrote:
I recommend reverting this commit. I mess with these flags arrays all the time when I'm hacking vere to debug things anyway.
On Mon, May 28, 2018 at 1:15 PM, cgyarvin notifications@github.com wrote:
The opt level now is set in the bowels of meson as far as I can tell.
Interesting that O2 is that different...
Sent from my iPhone
On May 28, 2018, at 1:03 PM, Anthony Arroyo notifications@github.com wrote:
Whoop sorry.
On Mon, May 28, 2018 at 1:01 PM Paul Driver notifications@github.com wrote:
anthony's link just goes to master, here is the correct link: 26996c2b1647c0017bea2ab8789768daf0208709 https://github.com/urbit/urbit/commit/ 26996c2b1647c0017bea2ab8789768 daf0208709
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392595285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4zyoeH0B5_ EY639Ezh7Dm-KcR2axks5t3FezgaJpZM4TOczR .
-- — ~poldec-tonteg http://urbit.org — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597144, or mute the thread https://github.com/notifications/unsubscribe-auth/ AADRpxLULLSza3PzHrxqxxsrg1nDPZAIks5t3FrxgaJpZM4TOczR
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-392597438, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiQ4wOb4OxJIhhjDZmCtIvqaDVPPt0eks5t3FuCgaJpZM4TOczR .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
With this revert in place on latest master, I can now boot a comet without any stack overflows. I'll try later with my planet -- I expect that to work as well. Thanks!
So, I'm still having what appears to be a variation of this problem with my ~planet.
Originally, for both planet and comet, I was getting this failure, looping indefinitely:
|«ames»ng hood, this may take a few minutes>
recover: dig: over
over
[%swim-call-vane %a %soft ~]
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.673 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.674 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.675 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.676 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.677 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.678 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 9].[2.683 52]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 13].[2.679 51]>
/~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 17].[2.679 51]>
After updating vere to build with -O3, I was able to boot a new comet successfully. (I didn't try continuing on with the same comet.)
I tried my planet again with -w/-t, and was able to successfully %dill-init as a comet, but can't complete the upgrade to the planet. It loops indefinitely with this error:
[%dill-init ~master-morzod %hood]
activated sync from %kids on ~binzod to %base
activated sync from %base on ~master-morzod to %home
sync succeeded from %base on ~master-morzod to %home
activated sync from %base on ~master-morzod to %kids
sync succeeded from %base on ~master-morzod to %kids
>> [%mo-not-running %talk %peer]
>> [%mo-not-running %dojo %peer]
>> [%mo-not-running %hall %peer]
>> [%mo-not-running %hall %peer]
'initial merge succeeded'
[%vega-start-hoon /~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon]
<awaiting hood, this may take a few minutes>
recover: dig: over
over
/~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon
I'm not sure how to debug that -- I don't think I've seen "recover: dig: over" without a stacktrace before.
Also, since I'm not crystal clear on the multi-homing identity upgrade semantics, has my planet completed the ticket handshake with its star? How would I check that from the star side?
Your planet definitely exists. What’s on that stack? Can you get a C trace?
Sent from my iPhone
On May 30, 2018, at 2:06 PM, Joe Bryan notifications@github.com wrote:
So, I'm still having what appears to be a variation of this problem with my ~planet.
Originally, for both planet and comet, I was getting this failure, looping indefinitely:
|«ames»ng hood, this may take a few minutes> recover: dig: over over [%swim-call-vane %a %soft ~] /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.673 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.674 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.675 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.676 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.677 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.678 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 9].[2.683 52]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 13].[2.679 51]> /~zod/home/~2018.5.25..02.00.34..f57e/sys/zuse:<[2.679 17].[2.679 51]> After updating vere to build with -O3, I was able to boot a new comet successfully. (I didn't try continuing on with the same comet.)
I tried my planet again with -w/-t, and was able to successfully %dill-init as a comet, but can't complete the upgrade to the planet. It loops indefinitely with this error:
[%dill-init ~master-morzod %hood] activated sync from %kids on ~binzod to %base activated sync from %base on ~master-morzod to %home sync succeeded from %base on ~master-morzod to %home activated sync from %base on ~master-morzod to %kids sync succeeded from %base on ~master-morzod to %kids
[%mo-not-running %talk %peer] [%mo-not-running %dojo %peer] [%mo-not-running %hall %peer] [%mo-not-running %hall %peer] 'initial merge succeeded' [%vega-start-hoon /~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon] <awaiting hood, this may take a few minutes> recover: dig: over over /~master-morzod/home/~2018.5.29..07.03.35..6160/sys/hoon I'm not sure how to debug that -- I don't think I've seen "recover: dig: over" without a stacktrace before.
Also, since I'm not crystal clear on the multi-homing identity upgrade semantics, has my planet completed the ticket handshake with its star? How would I check that from the star side?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Here's some stacktraces procured by running the planet in lldb and hitting ^c:
Neither build had the same interim printfs (ie, %dill-init
and so on) -- I'm not sure if that's because the lldb terminal environment, or some more significant difference. The debug build overflowed repeatedly and quickly, and the stack was different each time I interrupted it. The release build consistently had the u3v_poke
and lots of _nock_on
on the stack if I interrupted it, but would crash itself in u3wa_sub
if I did not.
I'm not sure how to interpret any of that. Happy to try other things, or pack up the pier if you'd like to see it.
Hitting ^C at some random moment tells us absolutely nothing. These stack traces are obviously not deep.
The question is what the stack looks like when it crashes. You'll obviously need to disable the stack overflow handler to get that trace.
On Wed, May 30, 2018 at 3:09 PM, Joe Bryan notifications@github.com wrote:
Here's some stacktraces procured by running the planet in lldb and hitting ^c:
- debug build: https://pastebin.com/tPSfJNSt
- release build: https://pastebin.com/45F6m7hE
Neither build had the same interim printfs (ie, %dill-init and so on) -- I'm not sure if that's because the lldb terminal environment, or some more significant difference. The debug build overflowed repeatedly and quickly, and the stack was different each time I interrupted it. The release build consistently had the u3v_poke and lots of _nock_on on the stack if I interrupted it, but would crash itself in u3wa_sub if I did not.
I'm not sure how to interpret any of that. Happy to try other things, or pack up the pier if you'd like to see it.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/urbit/arvo/issues/709#issuecomment-393336075, or mute the thread https://github.com/notifications/unsubscribe-auth/AALyAXqvzFt6OUevQ1tdXu1nOqkN7Rjgks5t3xi2gaJpZM4TOczR .
In manage.c
, #define NO_OVERFLOW
.
Core dump from release build:
* thread #1, stop reason = signal SIGSTOP
* frame #0: 0x000000010d5f5bd4 urbit`u3r_mug + 4
frame #1: 0x000000010d5f5ca3 urbit`u3r_mug + 211
frame #2: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #3: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #4: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #5: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #6: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #7: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #8: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #9: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #10: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #11: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #12: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #13: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #14: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #15: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #16: 0x000000010d5f5cad urbit`u3r_mug + 221
frame #17: 0x000000010d5f5cad urbit`u3r_mug + 221
Release build run in lldb:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x00007fff72538b6e libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff72703080 libsystem_pthread.dylib`pthread_kill + 333
frame #2: 0x00007fff724941ae libsystem_c.dylib`abort + 127
frame #3: 0x000000010002130a urbit`u3m_bail + 58
frame #4: 0x000000010002178a urbit`u3m_error + 26
frame #5: 0x000000010002aa3b urbit`u3wa_sub + 91
frame #6: 0x000000010001fbd3 urbit`_cj_kick_z + 179
frame #7: 0x000000010001f760 urbit`u3j_kick + 528
frame #8: 0x0000000100023ad9 urbit`_n_nock_on + 249
frame #9: 0x0000000100023ef9 urbit`_n_nock_on + 1305
frame #10: 0x0000000100023ee4 urbit`_n_nock_on + 1284
frame #11: 0x0000000100023ef9 urbit`_n_nock_on + 1305
frame #12: 0x0000000100023acc urbit`_n_nock_on + 236
frame #13: 0x0000000100023b65 urbit`_n_nock_on + 389
frame #14: 0x0000000100023ee4 urbit`_n_nock_on + 1284
frame #15: 0x0000000100023ef9 urbit`_n_nock^D
Core dump from debug build (much quicker crash):
* thread #1, stop reason = signal SIGSTOP
* frame #0: 0x000000010c285a38 urbit`_song_x(a=3221668534, b=3221668534, uni=(urbit`_sung_one at retrieve.c:547)) at retrieve.c:763
frame #1: 0x000000010c2862f0 urbit`u3r_sung(a=3221668534, b=3221668534) at retrieve.c:884
frame #2: 0x000000010c2769dc urbit`_ch_node_gut(han_u=0x00000002030abafc, lef_w=15, rem_w=25359, key=3221668534) at hashtable.c:646
frame #3: 0x000000010c276aad urbit`_ch_node_gut(han_u=0x000000020341d240, lef_w=20, rem_w=90895, key=3221668534) at hashtable.c:659
frame #4: 0x000000010c27689a urbit`u3h_gut(har_p=14131959, key=3221668534) at hashtable.c:694
frame #5: 0x000000010c278f3b urbit`u3j_find(bat=3221668534) at jets.c:206
frame #6: 0x000000010c2795d9 urbit`u3j_kick(cor=3221702479, axe=8) at jets.c:613
frame #7: 0x000000010c283c4e urbit`_n_nock_on(bus=3756596240, fol=3221664515) at nock.c:341
frame #8: 0x000000010c283bcf urbit`_n_nock_on(bus=3756596240, fol=3221863267) at nock.c:320
frame #9: 0x000
Debug build running in lldb:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSEGV
* frame #0: 0x000000010003ea92 urbit`_song_x(a=143, b=143, uni=(urbit`_sang_one at retrieve.c:456)) at retrieve.c:766
frame #1: 0x000000010003e9c0 urbit`u3r_sang(a=143, b=143) at retrieve.c:849
frame #2: 0x000000010003f28c urbit`u3r_sing(a=143, b=143) at retrieve.c:863
frame #3: 0x00000001000339fe urbit`_cj_fine(cup=3234234471, mop=3234234465, cor=3221702449) at jets.c:564
frame #4: 0x0000000100034213 urbit`_cj_fine(cup=3234234471, mop=3234234465, cor=3221702449) at jets.c:591
frame #5: 0x0000000100034213 urbit`_cj_fine(cup=3228621188, mop=3230740566, cor=3221702479) at jets.c:591
frame #6: 0x0000000100032df5 urbit`u3j_kick(cor=3756595634, axe=2) at jets.c:643
frame #7: 0x000000010003cc4e urbit`_n_nock_on(bus=3756596276, fol=3221644123) at nock.c:341
frame #8: 0x000000010003c6af urbit`_n_nock_on(bus=3756595592, fol=3221782497) at nock.c:206
frame #9: 0x000000010003c6d2 urbit`_n_nock_on(bu)
Those are complete backtraces in each case -- I assuming the stack is being clobbered somehow.
Note: none of those were manually interrupted. Each time, I just waited for the crash.
Wait wut? I thought you said ^c? Look at the stack, it’s only 1500 frames deep... the plot thickens...
Sent from my iPhone
On May 30, 2018, at 3:41 PM, Joe Bryan notifications@github.com wrote:
Note: none of those were manually interrupted. Each time, I just waited for the crash.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Again my clarifications serve only to confuse... The pastebin traces were created through manual interrupts. The four short traces in my recent comment are from disabling the overflow handler and running at different optimization levels, directly, and in lldb. Each time, I waited for vere to crash, and then printed the backtrace, directly or from the core file.
There’s clearly some real corruption here. Hopefully it’s the same thing as our other corruption bug.
There are two approaches: get a better read on which change introduced this, probably with more careful and aggressive bisecting; or debug the memory pool from first principles.
I’d try #1 first and it’s probably a better time investment than any HTTP work :-(. I do know how to debug memory corruption from first principles (instrument with sanity checks until you find where the error is introduced) but it ain’t easy or fun.
Sent from my iPhone
On May 30, 2018, at 3:41 PM, Joe Bryan notifications@github.com wrote:
Note: none of those were manually interrupted. Each time, I just waited for the crash.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Not clear that NO_OVERFLOW is working correctly in that case. An overflow trace should be, obviously, giant. Make sure you don’t have corefile truncation issues.
Sent from my iPhone
On May 30, 2018, at 3:50 PM, Joe Bryan notifications@github.com wrote:
Again my clarifications serve only to confuse... The pastebin traces were created through manual interrupts. The four short traces in my recent comment are from disabling the overflow handler and running at different optimization levels, directly, and in lldb. Each time, I waited for vere to crash, and then printed the backtrace, directly or from the core file.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
the
frame #14: 0x0000000100023ee4 urbit`_n_nock_on + 1284
frame #15: 0x0000000100023ef9 urbit`_n_nock^D
and
frame #7: 0x000000010003cc4e urbit`_n_nock_on(bus=3756596276, fol=3221644123) at nock.c:341
frame #8: 0x000000010003c6af urbit`_n_nock_on(bus=3756595592, fol=3221782497) at nock.c:206
frame #9: 0x000000010003c6d2 urbit`_n_nock_on(bu)
seem v suspicious
I agree, but I'm not sure what to suspect ... ;) It's worth noting that lldb hangs for a minute when asked to print those traces.
I'll come back to this later this evening.
So, I'm pleased to announce that this is fixed. I'm somewhat less pleased to announce that it was fixed without code changes...
The upside is that I am now well-schooled in the difference between hard and soft limits set via ulimit
, and very consciously aware of the fact that vere requests the maximum stack size (ie, the hard limit) in manage.c
.
Long story short, I had "doubled" my stack size when I first started hitting the original bug (unoptimized build), and never reset that to the defaults. And when I doubled the stack size, I set both the soft and hard limits to be double the default soft limit. As you can imagine, this was less than ideal:
$ ulimit -s
8192
$ ulimit -Hs
65532
$ ulimit -s 16384
$ ulimit -s
16384
$ ulimit -Hs
16384
Thanks to @frodwith for helping me debug!
Thank you! Yes, ulimit is a usability disaster...
Sent from my iPhone
On May 31, 2018, at 3:40 PM, Joe Bryan notifications@github.com wrote:
So, I'm pleased to announce that this is fixed. I'm somewhat less pleased to announce that it was fixed without code changes...
The upside is that I am now well-schooled in the difference between hard and soft limits set via ulimit, and very consciously aware of the fact that vere requests the maximum stack size (ie, the hard limit) in manage.c.
Long story short, I had "doubled" my stack size when I first started hitting the original bug (unoptimized build), and never reset that to the defaults. And when I doubled the stack size, I set both the soft and hard limits to be double the default soft limit. As you can imagine, this was less than ideal:
$ ulimit -s 8192 $ ulimit -Hs 65532 $ ulimit -s 16384 $ ulimit -s 16384 $ ulimit -Hs 16384 Thanks to @frodwith for helping me debug!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Having irreparably derailed this issue, I'd like to summarize for posterity.
This bug should be fixed by @frodwith's recent rewrite of +jam
, which uses an explicit stack and shouldn't overflow.
Fixed. Caused by an accidentally de-optimized build.
Same as above.
False report. Caused by my ulimit foolishness.
I vote to close this issue, and track any subsequent reproductions of the first problem (or any new overflows) in a new ticket.
@joemfb Fine by me.
I will have opportunity to test some larger syncs soon,
Having been deploying plato for some time now, I have accustomed myself with effects of extensive merges and syncs. It seems (no evidence except statistical stack traces which always hit
jam
atzuse.hoon:2701
) that it is very slow because of packet encryption?Whenever I let my clay history get slightly larger (say ~50 commits to clay of few hoon files), the merges become unbearably long, in the order of half an hours, or hours even.
The only help for now have been wrecking moons left and right and putting the code from scratch on a new one.
Am I hitting some strange bug, or is it considered a normal operation presently? I would think since jam is jetted, it should have negligible impact.