guillochon / MOSFiT

Modular Open Source Fitter for Transients
http://mosfit.readthedocs.io/
MIT License
41 stars 53 forks source link

mpirun doesn't work for np > 1 #191

Open LydiaMak opened 3 years ago

LydiaMak commented 3 years ago

Hi,

When running mpirun -np 2 (or higher) mosfit {json_file} -m {model} it crashes. I have mentioned it to Matt in private communication but since the issue still exists I thought to open an issue just in case someone have come across a similar issue. The error I get is:

Traceback (most recent call last): File "/Users/lydiamakrygianni/opt/miniconda3/lib/python3.8/site-packages/schwimmbad/mpi.py", line 72, in init self.wait() File "/Users/lydiamakrygianni/opt/miniconda3/lib/python3.8/site-packages/schwimmbad/mpi.py", line 122, in wait func, arg = task ValueError: not enough values to unpack (expected 2, got 1) application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1

for as many processes with rank>0.

It's been run on various machines and I have also tried the example but it also doesn't work. I use python 3.8.

If anyone got any idea (or can confirm example works for them), it would be quite useful for some very long runs!

Cheers,

Lydia

wakatara commented 3 years ago

I also have the precise same problem on OSX, and several flavours of Linux.

bmockler commented 3 years ago

Hi all, it looks like there might be a bug in schwimmbad which is causing incompatibility with newer versions of mpi4py: https://githubmemory.com/repo/adrn/schwimmbad/issues

I get the same issue when I am using schwimmbad v0.3.1 and mpi4py v3.0.3. When i downgrade to schwimmbad v0.3.0, it works for me.

~Brenna

On Aug 16, 2021, at 8:00 AM, Daryl Manning @.***> wrote:

 I also have the precise same problem on OSX, and several flavours of Linux.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ZhihaoChen5 commented 2 years ago

Hi all, it looks like there might be a bug in schwimmbad which is causing incompatibility with newer versions of mpi4py: https://githubmemory.com/repo/adrn/schwimmbad/issues I get the same issue when I am using schwimmbad v0.3.1 and mpi4py v3.0.3. When i downgrade to schwimmbad v0.3.0, it works for me. ~Brenna On Aug 16, 2021, at 8:00 AM, Daryl Manning @.***> wrote:  I also have the precise same problem on OSX, and several flavours of Linux. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Agreed! If v0.3.1 doesn't work, please try v0.3.0. At least it works for me.

dnfarias commented 2 years ago

Hi all,

I'm facing the same issue on OSX Monterey with python3.9 (anaconda env). Can you confirm the working python version, please? (I'll try to use a fresh new environment). Thanks!

ZhihaoChen5 commented 2 years ago

Hi all,

I'm facing the same issue on OSX Monterey with python3.9 (anaconda env). Can you confirm the working python version, please? (I'll try to use a fresh new environment). Thanks!

Python 3.6.8 MOSFiT v1.1.7 and schwimmbad 0.3.0 for me.

dnfarias commented 2 years ago

Dear @hatter5,

I can confirm that it's working for me :-) Many thanks!

pkgw commented 3 weeks ago

See also this issue on the schwimmbad side: https://github.com/adrn/schwimmbad/issues/32#issuecomment-2306095394

Adrian's explanation is that basically MOSFiT shouldn't be using the pool.wait() function directly — it sounds like schwimmbad users shouldn't be accessing its MPI pool directly.

mnicholl commented 2 weeks ago

Great to finally understand this! Though I don’t know how to strip out the pool.wait() without breaking other things. Is it essential to fix this before the conda release?

On 23 Aug 2024, at 14:12, Peter Williams @.***> wrote:

See also this issue on the schwimmbad side: adrn/schwimmbad#32 (comment) https://github.com/adrn/schwimmbad/issues/32#issuecomment-2306095394 Adrian's explanation is that basically MOSFiT shouldn't be using the pool.wait() function directly — it sounds like schwimmbad users shouldn't be accessing its MPI pool directly.

— Reply to this email directly, view it on GitHub https://github.com/guillochon/MOSFiT/issues/191#issuecomment-2307067545, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ74QW2LKDXPJ66TICJE4DZS4YKZAVCNFSM6AAAAABNAFF2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBXGA3DONJUGU. You are receiving this because you are subscribed to this thread.

pkgw commented 2 weeks ago

No, I completed the Conda update last week or so. It looks like this issue has been around for several years at this point, so it's clearly not a showstopper for a lot of people.

Based on Adrian's advice it sounds like the relevant code should be redesigned to either use only supported schwimmbad interfaces, or to use some other mechanism for inter-process communication. I don't have a sense of how dirty of a hack the current code is — maybe it's just a matter of using some of the MPI libraries more directly, or maybe this is exposing a bigger architectural issue that needs addressing. Someone's going to need to sit down and understand the current code, and research solutions.

dnfarias commented 2 weeks ago

Hi all!,

I just want to say that @pkgw is totally right. The latest version of schwimmbad is incompatible with the current version of MOSFiT. However, I can run MOSFiT in a cluster with mpirun under this configuration: Python 3.11.0, MOSFiT 1.1.9, schwimmbad 0.3.0.

I'm pretty sure it wouldn't be so difficult to make MOSFiT compatible with schwimmbad, but two years ago, I just left it like that.