nv-legate / legate.core

The Foundation for All Legate Libraries
https://docs.nvidia.com/legate/24.06/
Apache License 2.0
186 stars 61 forks source link

[BUG] legion_prof not showing #930

Closed CharlelieLrt closed 6 months ago

CharlelieLrt commented 8 months ago

Software versions

Python : 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ] Platform : macOS-12.6.7-arm64-arm-64bit Legion : v23.11.00.dev-37-g08f0ce8 Legate : 23.11.00.dev+37.g08f0ce8 [0 - 102df8580] 0.000062 {4}{threads}: reservation ('CPU proc 1d00000000000003') cannot be satisfied Cunumeric : 23.11.00.dev+28.gfd46f051.dirty Numpy : 1.26.2 Scipy : 1.11.4 Numba : 0.58.1 CTK package : (failed to detect) GPU driver : (nvidia-smi missing) GPU devices : (nvidia-smi missing)

Observed behavior

After generating profiles I try to view them with legion_prof --view legate_*.prof. This prints a bunch of numbers, but does not actually start the profiler GUI.

manopapad commented 8 months ago

Could you please share the console output you're seeing (or a screenshot)?

CharlelieLrt commented 8 months ago

Here it is (it's truncated):

Reading log file "legate_0.prof"...
Reading log file "legate_1.prof"...
thread '<unnamed>' panicked at /Users/charlelie/Documents/RESEARCH/CODES/LEGATE/legate.core/_skbuild/macosx-12.0-arm64-3.10/cmake-build/_deps/legion-src/tools/legion_prof_rs/src/serialize.rs:1064:66:
called `Result::unwrap()` on an `Err` value: Error(Error { input: [90, 101, 114, 111, 84, 105, 109, 101, 32, 123, 105, 100, 58, 52, 54, 44, 32, 122, 101, 114, 111, 95, 116, 105, 109, 101, 58, 108, 111, 110, 103, 32, 108, 111, 110, 103, 58, 56, 125, 10, 77, 101, 109, 68, 
[...]
 2, 62, 215, 2, 0, 0, 0, 45, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 29, 85, 27, 0, 0, 0, 0, 0, 0, 207, 198, 62, 215, 2, 0, 0, 0, 55, 200, 62, 215, 2, 0, 0, 0], code: Tag })
manopapad commented 8 months ago

Do you mind sharing your profiler dump file? This sounds like an issue in the profiler script. @eddy16112 @bryevdv or @elliottslaughter could possibly take a look. Folks, note the Legion commit from Charlelie's message above: g08f0ce8.

Are you collecting the profile on the same machine as the one where you're trying to view it? This doesn't need to be the case, but the dump should ideally be generated by the same version of Legion as the one backing the profile viewer (although there should be a check at this point for that...).

elliottslaughter commented 8 months ago

This is a parse error. It happens when you use a different version of the profiler than what Legion the application was built with.

manopapad commented 8 months ago

@eddy16112 I thought you had added a version compatibility check that would check for this situation? Am I misremembering?

elliottslaughter commented 8 months ago

It's blocked on me figuring out the best way to embed the version number in Rust.

CharlelieLrt commented 7 months ago

@manopapad I am a little confused, 08f0ce8 is the hash of a Legate commit, not a Legion commit. What is even more surprising, is that I have tried to use the option --legate-branch to specify a Legion commit when building legate.

So, the profiles were generated with a version of legate built with --legion-branch 04ee5be1dc3b742f195348c78458450f5dd35f44 and that seemed to work, but legate-issue shows Legion: v23.11.00.dev-37-gac081ac.

I am trying to visualize the profiles on another machine. I have tried to rebuild legate with --legion-branch 04ee5be1dc3b742f195348c78458450f5dd35f44, but legate-issue shows v23.11.00.dev-54-g40f6061. So I guess I misunderstood how the option --legion-branch works, as it does not seem to be picking up a commit hash. Is there a way to directly specify a commit number to the install script?

I also don't understand the commit hash shown by legate-issue. Is there another way to check which Legion commit I am using?

manopapad commented 7 months ago

I think the legate-issue script is getting confused, and reporting the legate commit hash for both Legion and Legate (or possibly the legion version was mis-recorded at installation time). @bryevdv care to take a quick look? Here's the relevant bit from Charlelie's legate-issue output.

Legion : v23.11.00.dev-37-g08f0ce8
Legate : 23.11.00.dev+37.g08f0ce8
bryevdv commented 7 months ago

legate-issue just does this:

def try_version(module_name: str, attr: str) -> str:
    module = import_module(module_name)
    return getattr(module, attr) if module else None

And then:

    print(f"Legion      :  {try_version('legion_info', '__version__')}")
    print(f"Legate      :  {try_version('legate', '__version__')}")

So if the reported legion version is not correct I think that must mean it's not getting correctly put into install_info.py (which happens at build time, I think?)

Edit: I just noticed that is legion_info not install_info I don't know what generates that file

bryevdv commented 7 months ago

In any case when I run on recent install from branch-24.01 I get a similar result:

Legion      :  v23.11.00.dev-55-g2bda9ab
Legate      :  23.11.00.dev+55.g2bda9ab

But these seem to actually match the installed modules (editable install into env old311) :

In [1]: import legion_info; import legate

In [2]: legion_info
Out[2]: <module 'legion_info' from '/home/bryan/anaconda3/envs/old311/lib/python3.11/site-packages/legion_info.py'>

In [3]: legate
Out[3]: <module 'legate' from '/home/bryan/work/legate.core/legate/__init__.py'>

In [4]: legion_info.__version__
Out[4]: 'v23.11.00.dev-55-g2bda9ab'

In [5]: legate.__version__
Out[5]: '23.11.00.dev+55.g2bda9ab'

So the issue, whatever it is, does not seem to be in legate-issue specifically.

bryevdv commented 7 months ago

Well, legion_info_build.py does this:

def build_legion_info():
    try:
        long_version = subprocess.run(["git", "describe", "--tags"], capture_output=True).stdout.decode().strip()
    except Exception:
        long_version = "<unknown>"

    with open("legion_info.py", "w") as f:
        f.write(f"__version__ = {long_version!r}\n")

which presumably reports the legate tags when the build downloads legion itself and builds under _skbuild

~/work/legate.core branch-24.01
old311 ❯ git describe --tags
v23.11.00.dev-55-g2bda9ab

~/work/legate.core branch-24.01
old311 ❯ cd ./_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src

~/work/legate.core/_skbuild/linux-x86_64-3.11/cmake-build/_deps/legion-src branch-24.01
old311 ❯ git describe --tags                                         
v23.11.00.dev-55-g2bda9ab

Not really sure how this might be fixed for this build scenario that all happens inside one repo.

manopapad commented 7 months ago

OK, I think the situation will become both simpler and more complicated once we switch to rolling our own Legion python bindings. We would likely not be generating a legion_info.py anymore, so we'd need to parse a Legion header to get this info, or query some utility Legion binary or something to that effect. And hopefully those are recording the Legion version correctly.

bryevdv commented 7 months ago

Just to slightly clarify, legion_info_build.py is part of legion and executed by the legion build, but it's just not really useful unless the legion build is actually happening from a real checkout (I added the file two years ago, I either didn't appreciate that fact or maybe the build worked differently before scikit-build)

CharlelieLrt commented 7 months ago

@manopapad I just wanted to confirm something about the profiler: the legate build only installs the legacy python profiler, and not the rust profiler, is that correct?

manopapad commented 7 months ago

install.py should build (and hopefully install) the Rust profiler if you pass --profiler.

CharlelieLrt commented 7 months ago

Right, I forgot this option.

Now I have the legion_prof from the same Legion commit that was used to generate the profiles. But when I run legion_prof --view legate_0.prof, I am getting:

Reading log file "legate_0.prof"...
Matched 1455809 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
thread 'main' panicked at src/backend/data_source.rs:382:75:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:        0x104943678 - std::backtrace_rs::backtrace::libunwind::trace::h2966c6fbfac9d426
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:        0x104943678 - std::backtrace_rs::backtrace::trace_unsynchronized::h8a5f4aefe890b7c5
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x104943678 - std::sys_common::backtrace::_print_fmt::h7574dd98fd39c257
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x104943678 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h298c9ab285ff3934
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x1049629d8 - core::fmt::rt::Argument::fmt::hf9661447f7b99899
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/fmt/rt.rs:142:9
   5:        0x1049629d8 - core::fmt::write::h4e276abdb6d0c2a1
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/fmt/mod.rs:1120:17
   6:        0x104940ba4 - std::io::Write::write_fmt::hd421848f5f0bf9d0
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/io/mod.rs:1762:15
   7:        0x1049434b4 - std::sys_common::backtrace::_print::h09e653c6686dbd70
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x1049434b4 - std::sys_common::backtrace::print::hd8bd9ecab1f94b94
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x1049449c0 - std::panicking::default_hook::{{closure}}::h520eeb743fc98fb4
  10:        0x104944708 - std::panicking::default_hook::ha6550ffe49b63df1
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:292:9
  11:        0x104944de8 - std::panicking::rust_panic_with_hook::hddb0e884a202de7c
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:779:13
  12:        0x104944cc4 - std::panicking::begin_panic_handler::{{closure}}::hd2798398a2fd9077
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:649:13
  13:        0x104943ae0 - std::sys_common::backtrace::__rust_end_short_backtrace::h9201cc364dbb8a23
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:170:18
  14:        0x104944a84 - rust_begin_unwind
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
  15:        0x1049a9540 - core::panicking::panic_fmt::h4d5168028d4c43c7
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
  16:        0x1049a95b4 - core::panicking::panic::h40561ff494e2b577
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:127:5
  17:        0x104304bb8 - legion_prof::backend::data_source::StateDataSource::new::ha8f8570daf8d31b5
  18:        0x104326704 - legion_prof::backend::viewer::start::h1e032833108cf662
  19:        0x1042577f8 - legion_prof::main::h8ab82f02e6baba83
  20:        0x104290638 - std::sys_common::backtrace::__rust_begin_short_backtrace::h52bd5ed120afe124
  21:        0x104290bfc - std::rt::lang_start::{{closure}}::h71d45874bc5edc4e
  22:        0x10493b5cc - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h1a7c0e059d971da5
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:284:13
  23:        0x10493b5cc - std::panicking::try::do_call::h07a34a23e615022b
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  24:        0x10493b5cc - std::panicking::try::h1111644420b4cc09
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  25:        0x10493b5cc - std::panic::catch_unwind::h31a3b9d6e2ef9973
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  26:        0x10493b5cc - std::rt::lang_start_internal::{{closure}}::h63c3452500a36531
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:48
  27:        0x10493b5cc - std::panicking::try::do_call::h9c5c8a2a0a297bb7
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  28:        0x10493b5cc - std::panicking::try::h424cfcafca1bde97
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  29:        0x10493b5cc - std::panic::catch_unwind::h345d3d448041017f
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  30:        0x10493b5cc - std::rt::lang_start_internal::h5b246d44f1526226
                               at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  31:        0x104290bd8 - std::rt::lang_start::h87d51f1055fa6b3b
  32:        0x10425dfc0 - _main

On the local machine where I am trying to visualize the profiles, legate-issue shows Legion : legion-23.09.0-4871-g04ee5be1d, and on the machine where I generated the profiles it shows Legion : legion-23.09.0-4871-g04ee5be1d, so same Legion commit on both machines.

manopapad commented 7 months ago

@elliottslaughter is this still a version error?

@CharlelieLrt could you please share the legate_0.prof file, and maybe @eddy16112 or @bryevdv can take a look as to why this is failing?

CharlelieLrt commented 7 months ago

Here it is legate_0.txt

eddy16112 commented 7 months ago

I tried the log file, and I am pretty sure that It is a version error. We have just added two changes to the profiler, one is the mapper call changes, the other is for the nsys folks. I tried to revert the second change, the log file still crashes. I am not able to revert the first change due to conflicts. Thus, I think the log file is generated with the pretty old version of Legion. @CharlelieLrt could you please figure out the commit of the Legion you are using, and go back to compile the Legion_prof using the old Legion?

CharlelieLrt commented 7 months ago

The profiles were generated with Legion commit 04ee5be1dc3b742f195348c78458450f5dd35f44 from Fri Dec 8. Legate was installed with ./install.py --profiler --max-dim 5 --openmp --legion-src-dir ../LEGION/legion_08122023/ where ../LEGION/legion_08122023/ is the same Legion commit (04ee5be1dc3b742f195348c78458450f5dd35f44 from Fri Dec 8.)

I am not sure what the --profiler option does. Does it pull the profiler from the same Legion source dir given by --legion-src-dir? If so, then they are using the same profiler version.

EDIT: I have tried to build separately the profiler (without using the legate install script), and it is giving me the same error.

eddy16112 commented 7 months ago

After switching to 04ee5be1dc3b742f195348c78458450f5dd35f44, both python and rust profiler works for me. @manopapad I do not know what install.py do, I think this will be a question for the legate team.

manopapad commented 7 months ago

./install.py --profiler --legion-src-dir <LEGION_DIR> will build the Rust profiler at the same commit as the overall Legion build (whatever version was checked out at <LEGION_DIR>). Therefore if both the profile-generating machine and the profile-viewing machine were using a --legion-src-dir at the same commit, the profile should be viewable.

I confirmed this behavior on my local machine. A Rust legion profiler fronted built at 04ee5be1dc3b742f195348c78458450f5dd35f44 can process the files just fine:

~/Downloads> ~/legate.core/_skbuild/macosx-14.0-x86_64-3.10/cmake-install/bin/legion_prof 930.txt
Reading log file "930.txt"...
Matched 1455809 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
Generating interactive visualization files in directory "legion_prof"

but when you pass --view it fails:

~/Downloads> ~/legate.core/_skbuild/macosx-14.0-x86_64-3.10/cmake-install/bin/legion_prof --view 930.txt
Reading log file "930.txt"...
Matched 1455809 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
thread 'main' panicked at src/backend/data_source.rs:382:75:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Therefore, this is not a version issue, but a bug in the viewer.

eddy16112 commented 7 months ago

It was a bug that I fixed in December https://gitlab.com/StanfordLegion/legion/-/commit/157aefb5dad298826f3850668f91ffec73bfbd74. @CharlelieLrt please update Legion to the latest control replication branch, or you can just cherry-pick the commit and re-compile the legion_prof, it should fix your problem.

CharlelieLrt commented 6 months ago

I confirm that it works. Thanks for the help!