Closed bpablo closed 3 weeks ago
Hi Bert,
long time no see. Which phoebe version are you using? Is this a specific Apple architecture, Mx models perhaps? With what errors does it fail? Can you increase verbosity
libphoebe.setup_verbosity(4)
and run it again? Can I get an account on such a machine, so that I can see where it fails?
On my Ubuntu 22.04, Intel i7 machine using current master branch it works without problems.
Cheers, Martin
Couldn't reproduce on my mac (2018 Intel running OS X 13.6.6 Ventura).
Neither can I on 12.6.5 Monterey with M1 Max in a conda env compiled with clang 14.0.6
Hey Martin,
I have an M1 Pro chip so arm architecture. I have a student using the same chip that can reproduce. I am running 14.4.1 and Xcode 15, which are new but I updated to see if I could fix the problem. Unsure exactly what I was running before, but there was no change. I tried pip installing locally and with conda as well as source installing locally and with conda, both of which give the same result. Here is the full code with the verbosity flag: import numpy as np import libphoebe libphoebe.setup_verbosity(4) values = np.array([3.13850013e-18, 9.30808211e-33, 1.00000000e+00]) libphoebe.roche_misaligned_marching_mesh(0.6, 1.0, 0.8140441018004452, values, 6.056472707118865, 0.026244734687742324) roche_misaligned_marching_mesh::START roche_misaligned_marching_mesh::spin:3.1385001300000000e-18 9.3080821100000003e-33 1.0000000000000000e+00 Omega:6.0564727071188651e+00 q=5.9999999999999998e-01 F=1.0000000000000000e+00 d=8.1404410180044517e-01 delta =2.6244734687742324e-02 full=1 max_triangles=10000000 roche_misaligned_marching_mesh::r=5.8793388891838794e-19 1.7436790462404645e-33 1.8732957290601990e-01 g=6.7438950320921065e-02 2.6704016719047415e-31 2.8689064410334705e+01 Segmentation fault: 11
Since this (or something like it) is now causing segfaults on the CI as well, it might be possible to create a PR/branch that strips down the CI matrix and tests to just the combination that reproduces the segfaults and diagnose there. It also is possible to then enter that runner interactively (although I don't have any experience doing that). @horvatm - do you think this would help track down the problem?
See #856 for a CI environment that reproduces the workflow (thanks @bpablo). You should be able to force that to rerun and enter interactively to debug (or push any print statements into the code to that branch to have them exposed in the CI logs).
I managed to ssh into the runner and get a backtrace:
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x600020000018)
frame #0: 0x00000001147e2888 libphoebe.cpython-311-darwin.so`Tmarching<double, Tmisaligned_roche<double>>::check_bad_pairs(std::__1::vector<Tmarching<double, Tmisaligned_roche<double>>::Tvertex, std::__1::allocator<Tmarching<double, Tmisaligned_roche<double>>::Tvertex>>&, double const&) [inlined] double utils::dot3D<double, double>(x=<unavailable>, y=0x0000600020000018) at utils.h:229:17 [opt]
226
227 // x^T.y
228 template <class T, class F> T inline dot3D(T x[3], F y[3]) {
-> 229 return x[0]*y[0] + x[1]*y[1] + x[2]*y[2];
230 }
231
232 // z = x cross y
I'm posting this here if @horvatm has an idea what's happening, otherwise I'll try to figure it out. The action runner now gives you a ssh command to log in straight to the runner, look for it in the action output. It will shut down the runner if you don't connect within 5 minutes.
This helps. I think I know what happens, but for the fix I would need access to a machine for few hours/days. I can not just add some hack and hope for the best. I think the hot fix is to add a line
if (P.size() <= 3) return Tbad_pair(0, 0); # no bad pairs can be determined
just after definition of function check_bad_pairs in "triang_marching.h" :
Tbad_pair check_bad_pairs(Tfront_polygon &P, const T &delta2)
But honestly, I don't know if this will have the desired effect :D
@horvatm, you can log on to the runner and try anything you'd like as if it's a local machine:
Web shell: https://tmate.io/t/fXhyf35QhnK5fQ3tbgDJadTFU
SSH: ssh fXhyf35QhnK5fQ3tbgDJadTFU@sfo2.tmate.io
You will have 5 min to log in, otherwise the runner will shut down. Same with idle time.
@horvatm just run the test as you would on a local machine. If it doesn't segfault, please open a separate PR with the fix against bugfix-2.4.14.
I think debugging in this is almost doable. I am just encountering some issues to do a fresh using
rm -rf build
pip install .
Namely, I get the following error.
.....
g++ -bundle -undefined dynamic_lookup -arch arm64 -arch x86_64 -g build/temp.macosx-10.9-universal2-cpython-311/phoebe/lib/libphoebe.o -o build/lib.macosx-10.9-universal2-cpython-311/libphoebe.cpython-311-darwin.so
g++: warning: this compiler does not support x86 ('-arch' option ignored)
0 0x102b8b648 __assert_rtn + 72
1 0x102abffac ld::AtomPlacement::findAtom(unsigned char, unsigned long long, ld::AtomPlacement::AtomLoc const*&, long long&) const + 1204
2 0x102ad5924 ld::InputFiles::SliceParser::parseObjectFile(mach_o::Header const*) const + 15164
3 0x102ae2e30 ld::InputFiles::parseAllFiles(void (ld::AtomFile const*) block_pointer)::$_7::operator()(unsigned long, ld::FileInfo const&) const + 420
4 0x18c7b6428 _dispatch_client_callout2 + 20
5 0x18c7ca850 _dispatch_apply_invoke3 + 336
6 0x18c7b63e8 _dispatch_client_callout + 20
7 0x18c7b7c68 _dispatch_once_callout + 32
8 0x18c7caeec _dispatch_apply_invoke_and_wait + 372
9 0x18c7c9e9c _dispatch_apply_with_attr_f + 1212
10 0x18c7ca08c dispatch_apply + 96
11 0x102b5d3b8 ld::AtomFileConsolidator::parseFiles(bool) + 292
12 0x102afe170 main + 9048
ld: Assertion failed: (resultIndex < sectData.atoms.size()), function findAtom, file Relocations.cpp, line 1336.
collect2: error: ld returned 1 exit status
error: command '/usr/local/bin/g++' failed with exit code 1
There seems to be some linkage problems on macs see https://apple.stackexchange.com/questions/466348/g-fails-after-13-4-13-6-1-update
Perhaps take a look at how the runner sets up the job to compile the code before it inits the ssh access and mimic that?
Good news: we figured out how to set up the debugging environment on the runners. @horvatm now has an idea where to look, so he will attempt to trace this segfault as time permits. Everyone else, hang on tight, we're on it.
I think I will need to do a series of modifications in order to debug this problem. Do I do them in branch pablo-patch-1, or should I fork this branch?
Essentially, I need a branch running only on mac and a single test, as I can test on linux at home. For now, the idea is to add a lot of print statements in triangulize_full_clever() and compare them step by step with the ones on a linux machine.
@horvatm, I would just rename the branch to something more telling, such as marching_bugfix, and use that.
we'll eventually need just the fix itself (not the changes to the GitHub workflows) in a PR to the bugfix branch, but you can place them here for now to debug and we can easily move them over later.
Of course, only cleaned and workable code will be merged into releases. I just need a place to test ideas in.
I've just failed to reproduce this with my new mac (M3, gcc-13, phoebe-2.4.13 installed with pip). I did initially have some problems installing which I think are related to different instances of gcc being using to compile and link, but forcing both to the same version (gcc-13) worked fine...
Closed by #884 (to be included in 2.4.14 bugfix release)
The following code fails on Mac with two different machines but works on windows: