phoebe-project / phoebe2

PHOEBE - Eclipsing Binary Star Modeling Software
http://phoebe-project.org
GNU General Public License v3.0
76 stars 28 forks source link

libphoebe function roche_misaligned_marching_mesh segmentation fault on MacOS #850

Closed bpablo closed 3 weeks ago

bpablo commented 2 months ago

The following code fails on Mac with two different machines but works on windows:

import numpy as np
import libphoebe
values = np.array([3.13850013e-18, 9.30808211e-33, 1.00000000e+00])

libphoebe.roche_misaligned_marching_mesh(0.6, 1.0, 0.8140441018004452, values, 6.056472707118865, 0.026244734687742324)
horvatm commented 2 months ago

Hi Bert,

long time no see. Which phoebe version are you using? Is this a specific Apple architecture, Mx models perhaps? With what errors does it fail? Can you increase verbosity

libphoebe.setup_verbosity(4)

and run it again? Can I get an account on such a machine, so that I can see where it fails?

On my Ubuntu 22.04, Intel i7 machine using current master branch it works without problems.

Cheers, Martin

Raindogjones commented 2 months ago

Couldn't reproduce on my mac (2018 Intel running OS X 13.6.6 Ventura).

kecnry commented 2 months ago

Neither can I on 12.6.5 Monterey with M1 Max in a conda env compiled with clang 14.0.6

bpablo commented 2 months ago

Hey Martin,

I have an M1 Pro chip so arm architecture. I have a student using the same chip that can reproduce. I am running 14.4.1 and Xcode 15, which are new but I updated to see if I could fix the problem. Unsure exactly what I was running before, but there was no change. I tried pip installing locally and with conda as well as source installing locally and with conda, both of which give the same result. Here is the full code with the verbosity flag: import numpy as np import libphoebe libphoebe.setup_verbosity(4) values = np.array([3.13850013e-18, 9.30808211e-33, 1.00000000e+00]) libphoebe.roche_misaligned_marching_mesh(0.6, 1.0, 0.8140441018004452, values, 6.056472707118865, 0.026244734687742324) roche_misaligned_marching_mesh::START roche_misaligned_marching_mesh::spin:3.1385001300000000e-18 9.3080821100000003e-33 1.0000000000000000e+00 Omega:6.0564727071188651e+00 q=5.9999999999999998e-01 F=1.0000000000000000e+00 d=8.1404410180044517e-01 delta =2.6244734687742324e-02 full=1 max_triangles=10000000 roche_misaligned_marching_mesh::r=5.8793388891838794e-19 1.7436790462404645e-33 1.8732957290601990e-01 g=6.7438950320921065e-02 2.6704016719047415e-31 2.8689064410334705e+01 Segmentation fault: 11

kecnry commented 2 months ago

Since this (or something like it) is now causing segfaults on the CI as well, it might be possible to create a PR/branch that strips down the CI matrix and tests to just the combination that reproduces the segfaults and diagnose there. It also is possible to then enter that runner interactively (although I don't have any experience doing that). @horvatm - do you think this would help track down the problem?

kecnry commented 2 months ago

See #856 for a CI environment that reproduces the workflow (thanks @bpablo). You should be able to force that to rerun and enter interactively to debug (or push any print statements into the code to that branch to have them exposed in the CI logs).

aprsa commented 2 months ago

I managed to ssh into the runner and get a backtrace:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x600020000018)
    frame #0: 0x00000001147e2888 libphoebe.cpython-311-darwin.so`Tmarching<double, Tmisaligned_roche<double>>::check_bad_pairs(std::__1::vector<Tmarching<double, Tmisaligned_roche<double>>::Tvertex, std::__1::allocator<Tmarching<double, Tmisaligned_roche<double>>::Tvertex>>&, double const&) [inlined] double utils::dot3D<double, double>(x=<unavailable>, y=0x0000600020000018) at utils.h:229:17 [opt]
   226 
   227    // x^T.y
   228    template <class T, class F> T inline dot3D(T x[3], F y[3]) {
-> 229      return x[0]*y[0] + x[1]*y[1] + x[2]*y[2];
   230    }
   231 
   232    // z = x cross y

I'm posting this here if @horvatm has an idea what's happening, otherwise I'll try to figure it out. The action runner now gives you a ssh command to log in straight to the runner, look for it in the action output. It will shut down the runner if you don't connect within 5 minutes.

horvatm commented 2 months ago

This helps. I think I know what happens, but for the fix I would need access to a machine for few hours/days. I can not just add some hack and hope for the best. I think the hot fix is to add a line

if (P.size() <= 3) return Tbad_pair(0, 0);    # no bad pairs can be determined

just after definition of function check_bad_pairs in "triang_marching.h" :

Tbad_pair check_bad_pairs(Tfront_polygon &P, const T &delta2)

But honestly, I don't know if this will have the desired effect :D

aprsa commented 2 months ago

@horvatm, you can log on to the runner and try anything you'd like as if it's a local machine:

aprsa commented 2 months ago

@horvatm just run the test as you would on a local machine. If it doesn't segfault, please open a separate PR with the fix against bugfix-2.4.14.

horvatm commented 2 months ago

I think debugging in this is almost doable. I am just encountering some issues to do a fresh using

rm -rf build
pip install .

Namely, I get the following error.

.....      
g++ -bundle -undefined dynamic_lookup -arch arm64 -arch x86_64 -g build/temp.macosx-10.9-universal2-cpython-311/phoebe/lib/libphoebe.o -o build/lib.macosx-10.9-universal2-cpython-311/libphoebe.cpython-311-darwin.so
      g++: warning: this compiler does not support x86 ('-arch' option ignored)
      0  0x102b8b648  __assert_rtn + 72
      1  0x102abffac  ld::AtomPlacement::findAtom(unsigned char, unsigned long long, ld::AtomPlacement::AtomLoc const*&, long long&) const + 1204
      2  0x102ad5924  ld::InputFiles::SliceParser::parseObjectFile(mach_o::Header const*) const + 15164
      3  0x102ae2e30  ld::InputFiles::parseAllFiles(void (ld::AtomFile const*) block_pointer)::$_7::operator()(unsigned long, ld::FileInfo const&) const + 420
      4  0x18c7b6428  _dispatch_client_callout2 + 20
      5  0x18c7ca850  _dispatch_apply_invoke3 + 336
      6  0x18c7b63e8  _dispatch_client_callout + 20
      7  0x18c7b7c68  _dispatch_once_callout + 32
      8  0x18c7caeec  _dispatch_apply_invoke_and_wait + 372
      9  0x18c7c9e9c  _dispatch_apply_with_attr_f + 1212
      10  0x18c7ca08c  dispatch_apply + 96
      11  0x102b5d3b8  ld::AtomFileConsolidator::parseFiles(bool) + 292
      12  0x102afe170  main + 9048
      ld: Assertion failed: (resultIndex < sectData.atoms.size()), function findAtom, file Relocations.cpp, line 1336.
      collect2: error: ld returned 1 exit status
      error: command '/usr/local/bin/g++' failed with exit code 1

There seems to be some linkage problems on macs see https://apple.stackexchange.com/questions/466348/g-fails-after-13-4-13-6-1-update

aprsa commented 2 months ago

Perhaps take a look at how the runner sets up the job to compile the code before it inits the ssh access and mimic that?

aprsa commented 2 months ago

Good news: we figured out how to set up the debugging environment on the runners. @horvatm now has an idea where to look, so he will attempt to trace this segfault as time permits. Everyone else, hang on tight, we're on it.

horvatm commented 2 months ago

I think I will need to do a series of modifications in order to debug this problem. Do I do them in branch pablo-patch-1, or should I fork this branch?

Essentially, I need a branch running only on mac and a single test, as I can test on linux at home. For now, the idea is to add a lot of print statements in triangulize_full_clever() and compare them step by step with the ones on a linux machine.

aprsa commented 2 months ago

@horvatm, I would just rename the branch to something more telling, such as marching_bugfix, and use that.

kecnry commented 2 months ago

we'll eventually need just the fix itself (not the changes to the GitHub workflows) in a PR to the bugfix branch, but you can place them here for now to debug and we can easily move them over later.

horvatm commented 2 months ago

Of course, only cleaned and workable code will be merged into releases. I just need a place to test ideas in.

Raindogjones commented 1 month ago

I've just failed to reproduce this with my new mac (M3, gcc-13, phoebe-2.4.13 installed with pip). I did initially have some problems installing which I think are related to different instances of gcc being using to compile and link, but forcing both to the same version (gcc-13) worked fine...

kecnry commented 3 weeks ago

Closed by #884 (to be included in 2.4.14 bugfix release)