pnnl / rofi

Other
10 stars 2 forks source link

ofi_monitor_cleanup: Assertion error #1

Closed kwaters4 closed 2 years ago

kwaters4 commented 2 years ago

When running a lamellar program in parallel there is an error when the program attempts to finish. I believe it is somewhere between ROFI and libfabric.

The example rust program is running to completion and then the following error occurs:

test_app: prov/util/src/util_mem_monitor.c:84: ofi_monitor_cleanup: Assertion `dlist_empty(&monitor->list)' failed.

libfabric version : 1.15.0

I can provide more details, just let me know.

rdfriese commented 2 years ago

I committed a fix to Rofi (on master) and to Lamellar (on dev) as there were some issues in both, hopefully the should cleanup properly now!

kwaters4 commented 2 years ago

Thanks, I will rebuild and test it this week.

kwaters4 commented 2 years ago

Rebuilt ROFI from master and used the old version as well.

Had some issue running make test this time around, still trying to identify what is the cause of this. Looks like it may be my end.

However, using the following line in my Cargo.toml lamellar = { git = "https://github.com/pnnl/lamellar-runtime", branch = "dev", features = ["enable-rofi"]} the error went away for both versions of ROFI.

Thanks!