Closed CharlelieLrt closed 6 months ago
Could you please share the console output you're seeing (or a screenshot)?
Here it is (it's truncated):
Reading log file "legate_0.prof"...
Reading log file "legate_1.prof"...
thread '<unnamed>' panicked at /Users/charlelie/Documents/RESEARCH/CODES/LEGATE/legate.core/_skbuild/macosx-12.0-arm64-3.10/cmake-build/_deps/legion-src/tools/legion_prof_rs/src/serialize.rs:1064:66:
called `Result::unwrap()` on an `Err` value: Error(Error { input: [90, 101, 114, 111, 84, 105, 109, 101, 32, 123, 105, 100, 58, 52, 54, 44, 32, 122, 101, 114, 111, 95, 116, 105, 109, 101, 58, 108, 111, 110, 103, 32, 108, 111, 110, 103, 58, 56, 125, 10, 77, 101, 109, 68,
[...]
2, 62, 215, 2, 0, 0, 0, 45, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 29, 85, 27, 0, 0, 0, 0, 0, 0, 207, 198, 62, 215, 2, 0, 0, 0, 55, 200, 62, 215, 2, 0, 0, 0], code: Tag })
Do you mind sharing your profiler dump file? This sounds like an issue in the profiler script. @eddy16112 @bryevdv or @elliottslaughter could possibly take a look. Folks, note the Legion commit from Charlelie's message above: g08f0ce8.
Are you collecting the profile on the same machine as the one where you're trying to view it? This doesn't need to be the case, but the dump should ideally be generated by the same version of Legion as the one backing the profile viewer (although there should be a check at this point for that...).
This is a parse error. It happens when you use a different version of the profiler than what Legion the application was built with.
@eddy16112 I thought you had added a version compatibility check that would check for this situation? Am I misremembering?
It's blocked on me figuring out the best way to embed the version number in Rust.
@manopapad I am a little confused, 08f0ce8
is the hash of a Legate commit, not a Legion commit. What is even more surprising, is that I have tried to use the option --legate-branch
to specify a Legion commit when building legate.
So, the profiles were generated with a version of legate built with --legion-branch 04ee5be1dc3b742f195348c78458450f5dd35f44
and that seemed to work, but legate-issue
shows Legion: v23.11.00.dev-37-gac081ac
.
I am trying to visualize the profiles on another machine. I have tried to rebuild legate with --legion-branch 04ee5be1dc3b742f195348c78458450f5dd35f44
, but legate-issue
shows v23.11.00.dev-54-g40f6061
.
So I guess I misunderstood how the option --legion-branch
works, as it does not seem to be picking up a commit hash.
Is there a way to directly specify a commit number to the install script?
I also don't understand the commit hash shown by legate-issue
. Is there another way to check which Legion commit I am using?
I think the legate-issue
script is getting confused, and reporting the legate commit hash for both Legion and Legate (or possibly the legion version was mis-recorded at installation time). @bryevdv care to take a quick look? Here's the relevant bit from Charlelie's legate-issue
output.
Legion : v23.11.00.dev-37-g08f0ce8
Legate : 23.11.00.dev+37.g08f0ce8
legate-issue
just does this:
def try_version(module_name: str, attr: str) -> str:
module = import_module(module_name)
return getattr(module, attr) if module else None
And then:
print(f"Legion : {try_version('legion_info', '__version__')}")
print(f"Legate : {try_version('legate', '__version__')}")
So if the reported legion version is not correct I think that must mean it's not getting correctly put into install_info.py
(which happens at build time, I think?)
Edit: I just noticed that is legion_info
not install_info
I don't know what generates that file
In any case when I run on recent install from branch-24.01
I get a similar result:
Legion : v23.11.00.dev-55-g2bda9ab
Legate : 23.11.00.dev+55.g2bda9ab
But these seem to actually match the installed modules (editable install into env old311
) :
In [1]: import legion_info; import legate
In [2]: legion_info
Out[2]: <module 'legion_info' from '/home/bryan/anaconda3/envs/old311/lib/python3.11/site-packages/legion_info.py'>
In [3]: legate
Out[3]: <module 'legate' from '/home/bryan/work/legate.core/legate/__init__.py'>
In [4]: legion_info.__version__
Out[4]: 'v23.11.00.dev-55-g2bda9ab'
In [5]: legate.__version__
Out[5]: '23.11.00.dev+55.g2bda9ab'
So the issue, whatever it is, does not seem to be in legate-issue
specifically.
Well, legion_info_build.py
does this:
def build_legion_info():
try:
long_version = subprocess.run(["git", "describe", "--tags"], capture_output=True).stdout.decode().strip()
except Exception:
long_version = "<unknown>"
with open("legion_info.py", "w") as f:
f.write(f"__version__ = {long_version!r}\n")
which presumably reports the legate
tags when the build downloads legion
itself and builds under _skbuild
~/work/legate.core branch-24.01
old311 ❯ git describe --tags
v23.11.00.dev-55-g2bda9ab
~/work/legate.core branch-24.01
old311 ❯ cd ./_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src
~/work/legate.core/_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src branch-24.01
old311 ❯ git describe --tags
v23.11.00.dev-55-g2bda9ab
Not really sure how this might be fixed for this build scenario that all happens inside one repo.
OK, I think the situation will become both simpler and more complicated once we switch to rolling our own Legion python bindings. We would likely not be generating a legion_info.py anymore, so we'd need to parse a Legion header to get this info, or query some utility Legion binary or something to that effect. And hopefully those are recording the Legion version correctly.
Just to slightly clarify, legion_info_build.py
is part of legion and executed by the legion build, but it's just not really useful unless the legion build is actually happening from a real checkout (I added the file two years ago, I either didn't appreciate that fact or maybe the build worked differently before scikit-build
)
@manopapad I just wanted to confirm something about the profiler: the legate build only installs the legacy python profiler, and not the rust profiler, is that correct?
install.py
should build (and hopefully install) the Rust profiler if you pass --profiler
.
Right, I forgot this option.
Now I have the legion_prof from the same Legion commit that was used to generate the profiles. But when I run legion_prof --view legate_0.prof
, I am getting:
Reading log file "legate_0.prof"...
Matched 1455809 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
thread 'main' panicked at src/backend/data_source.rs:382:75:
called `Option::unwrap()` on a `None` value
stack backtrace:
0: 0x104943678 - std::backtrace_rs::backtrace::libunwind::trace::h2966c6fbfac9d426
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
1: 0x104943678 - std::backtrace_rs::backtrace::trace_unsynchronized::h8a5f4aefe890b7c5
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x104943678 - std::sys_common::backtrace::_print_fmt::h7574dd98fd39c257
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:67:5
3: 0x104943678 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h298c9ab285ff3934
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:44:22
4: 0x1049629d8 - core::fmt::rt::Argument::fmt::hf9661447f7b99899
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/fmt/rt.rs:142:9
5: 0x1049629d8 - core::fmt::write::h4e276abdb6d0c2a1
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/fmt/mod.rs:1120:17
6: 0x104940ba4 - std::io::Write::write_fmt::hd421848f5f0bf9d0
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/io/mod.rs:1762:15
7: 0x1049434b4 - std::sys_common::backtrace::_print::h09e653c6686dbd70
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:47:5
8: 0x1049434b4 - std::sys_common::backtrace::print::hd8bd9ecab1f94b94
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:34:9
9: 0x1049449c0 - std::panicking::default_hook::{{closure}}::h520eeb743fc98fb4
10: 0x104944708 - std::panicking::default_hook::ha6550ffe49b63df1
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:292:9
11: 0x104944de8 - std::panicking::rust_panic_with_hook::hddb0e884a202de7c
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:779:13
12: 0x104944cc4 - std::panicking::begin_panic_handler::{{closure}}::hd2798398a2fd9077
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:649:13
13: 0x104943ae0 - std::sys_common::backtrace::__rust_end_short_backtrace::h9201cc364dbb8a23
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:170:18
14: 0x104944a84 - rust_begin_unwind
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
15: 0x1049a9540 - core::panicking::panic_fmt::h4d5168028d4c43c7
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
16: 0x1049a95b4 - core::panicking::panic::h40561ff494e2b577
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:127:5
17: 0x104304bb8 - legion_prof::backend::data_source::StateDataSource::new::ha8f8570daf8d31b5
18: 0x104326704 - legion_prof::backend::viewer::start::h1e032833108cf662
19: 0x1042577f8 - legion_prof::main::h8ab82f02e6baba83
20: 0x104290638 - std::sys_common::backtrace::__rust_begin_short_backtrace::h52bd5ed120afe124
21: 0x104290bfc - std::rt::lang_start::{{closure}}::h71d45874bc5edc4e
22: 0x10493b5cc - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h1a7c0e059d971da5
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:284:13
23: 0x10493b5cc - std::panicking::try::do_call::h07a34a23e615022b
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
24: 0x10493b5cc - std::panicking::try::h1111644420b4cc09
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
25: 0x10493b5cc - std::panic::catch_unwind::h31a3b9d6e2ef9973
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
26: 0x10493b5cc - std::rt::lang_start_internal::{{closure}}::h63c3452500a36531
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:48
27: 0x10493b5cc - std::panicking::try::do_call::h9c5c8a2a0a297bb7
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
28: 0x10493b5cc - std::panicking::try::h424cfcafca1bde97
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
29: 0x10493b5cc - std::panic::catch_unwind::h345d3d448041017f
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
30: 0x10493b5cc - std::rt::lang_start_internal::h5b246d44f1526226
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
31: 0x104290bd8 - std::rt::lang_start::h87d51f1055fa6b3b
32: 0x10425dfc0 - _main
On the local machine where I am trying to visualize the profiles, legate-issue
shows Legion : legion-23.09.0-4871-g04ee5be1d
, and on the machine where I generated the profiles it shows Legion : legion-23.09.0-4871-g04ee5be1d
, so same Legion commit on both machines.
@elliottslaughter is this still a version error?
@CharlelieLrt could you please share the legate_0.prof
file, and maybe @eddy16112 or @bryevdv can take a look as to why this is failing?
Here it is legate_0.txt
I tried the log file, and I am pretty sure that It is a version error. We have just added two changes to the profiler, one is the mapper call changes, the other is for the nsys folks. I tried to revert the second change, the log file still crashes. I am not able to revert the first change due to conflicts. Thus, I think the log file is generated with the pretty old version of Legion. @CharlelieLrt could you please figure out the commit of the Legion you are using, and go back to compile the Legion_prof using the old Legion?
The profiles were generated with Legion commit 04ee5be1dc3b742f195348c78458450f5dd35f44
from Fri Dec 8.
Legate was installed with ./install.py --profiler --max-dim 5 --openmp --legion-src-dir ../LEGION/legion_08122023/
where ../LEGION/legion_08122023/
is the same Legion commit (04ee5be1dc3b742f195348c78458450f5dd35f44
from Fri Dec 8.)
I am not sure what the --profiler
option does. Does it pull the profiler from the same Legion source dir given by --legion-src-dir
? If so, then they are using the same profiler version.
EDIT: I have tried to build separately the profiler (without using the legate install script), and it is giving me the same error.
After switching to 04ee5be1dc3b742f195348c78458450f5dd35f44, both python and rust profiler works for me. @manopapad I do not know what install.py
do, I think this will be a question for the legate team.
./install.py --profiler --legion-src-dir <LEGION_DIR>
will build the Rust profiler at the same commit as the overall Legion build (whatever version was checked out at <LEGION_DIR>
). Therefore if both the profile-generating machine and the profile-viewing machine were using a --legion-src-dir
at the same commit, the profile should be viewable.
I confirmed this behavior on my local machine. A Rust legion profiler fronted built at 04ee5be1dc3b742f195348c78458450f5dd35f44 can process the files just fine:
~/Downloads> ~/legate.core/_skbuild/macosx-14.0-x86_64-3.10/cmake-install/bin/legion_prof 930.txt
Reading log file "930.txt"...
Matched 1455809 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
Generating interactive visualization files in directory "legion_prof"
but when you pass --view
it fails:
~/Downloads> ~/legate.core/_skbuild/macosx-14.0-x86_64-3.10/cmake-install/bin/legion_prof --view 930.txt
Reading log file "930.txt"...
Matched 1455809 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
thread 'main' panicked at src/backend/data_source.rs:382:75:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Therefore, this is not a version issue, but a bug in the viewer.
It was a bug that I fixed in December https://gitlab.com/StanfordLegion/legion/-/commit/157aefb5dad298826f3850668f91ffec73bfbd74. @CharlelieLrt please update Legion to the latest control replication branch, or you can just cherry-pick the commit and re-compile the legion_prof, it should fix your problem.
I confirm that it works. Thanks for the help!
Software versions
Python : 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ] Platform : macOS-12.6.7-arm64-arm-64bit Legion : v23.11.00.dev-37-g08f0ce8 Legate : 23.11.00.dev+37.g08f0ce8 [0 - 102df8580] 0.000062 {4}{threads}: reservation ('CPU proc 1d00000000000003') cannot be satisfied Cunumeric : 23.11.00.dev+28.gfd46f051.dirty Numpy : 1.26.2 Scipy : 1.11.4 Numba : 0.58.1 CTK package : (failed to detect) GPU driver : (nvidia-smi missing) GPU devices : (nvidia-smi missing)
Observed behavior
After generating profiles I try to view them with
legion_prof --view legate_*.prof
. This prints a bunch of numbers, but does not actually start the profiler GUI.