Simulation becomes slow due to Legion warning1097

mw-k commented 2 years ago

Hello, I am a user of HTR-solver using an HPC system (unfortunately, a CPU-only system). I have a problem when I run the test case named "Franko".

While the simulation runs, the output returns the message as below.

[21 - 2aab1f891880] 116.356580 {4}{runtime}: [warning 1097] LEGION WARNING: WARNING: The runtime has failed to memoize the trace more than 5 times, due to the absence of a replayable template. It is highly likely that trace 0 will not be memoized for the rest of execution. The most recent template was not replayable for the following reason: Remote shard not replyable. Please change the mapper to stop making memoization requests. (from file /home01/x2242a06/legion/runtime/legion/legion_trace.cc:2019) For more information see: http://legion.stanford.edu/messages/warning_code.html#warning_code_1097

The warning does not stop the simulation. However, the simulation becomes slower and slower. The initial wall time for the one-step is 3sec, and the wall time increases continuously.

Franko_wall_time

As shown in the figure, the iterations are even only 5000 steps, the wall-time increases almost 10-15esc. If I do the simulation until 50000 iterations, it would be 100-150 sec in this tendency.

Here, I have the first question. Is this tendency normal in the Franko case? In my guess, the periodic wall-time can be normal while the continuous increment of the wall-time is unusual behavior. The second one is that if not, have you had this experience? I would like to how can I solve this problem.

mariodirenzo commented 2 years ago

Hi, I am sorry for the late reply. No, this is not the expected trend for the wall time. It should be constant after the first few steps. However, I am not able to replicate this issue. Could you provide some details about the system that you are using? To temporarily get rid of the problem you could recompile the code after commenting out the lines https://github.com/stanfordhpccenter/HTR-solver/blob/87945711619e3a19ab01ab080ad7802ea4e2a26a/src/prometeo.rg#L1740 and https://github.com/stanfordhpccenter/HTR-solver/blob/87945711619e3a19ab01ab080ad7802ea4e2a26a/src/prometeo.rg#L1746.

mw-k commented 2 years ago

I appreciate your kind reply :)

The following is the specification of the system.

There are two computational nodes.

One consists of Intel Xeon phi 7250 1.4GHz (Knight landing). It has 1 CPU (68 cores) and 96GB of memory on each node, and the total number of nodes is 8,305.
Another consists of Intel Xeon 6148 2.4GHz (Skylake). It has 2 CPUs (40 cores) and 192GB of memory on each node, and the total number of nodes is 132.
While the network is OPA (Omni-path), I used the IBV (Infini-band) for the conduit option.
The PBS job scheduler is used. If you need any more information, please notify me. I will give more details.

I will try your temporary solution and share the results.

Additionally, could I ask one more question?

I have the GPU workstation which has four RTX3080 (I turned off GASNET option since it does not have any network system in order to connect GPUs. They are embedded in one hardware system). When I tried to run HTR-solver w/ GPU on the GPU workstation, I notice that the current legion on the HTR-solver Github supports old Nvidia architecture.

So, I used the latest legion provided in the Stanford Gitlab, and HTR-solver is successfully installed with GPU_ARCH = AMPERE. However, there is an error during the simulation. the following is the message. I guess that the error is caused because RTX3080 has the sm_86 architecture, not sm_80.

prometeo_ConstPropMix.exec: /home/mwkim/htr_stanfrod/src/Utils/task_helper.hpp:136: void TaskHelper::base_gpu_wrapper(const Legion::Task, const std::vector&, Legion::Context, Legion::Runtime) [with T = LoadMixtureTask; Legion::Context = Legion::Internal::TaskContext*]: Assertion `task->arglen == 0' failed.

Another method is using the old architecture. I set GPU_ARCH = VOLTA. Also, it is installed while the simulation does not work. In this case, the warning message shows "the program installed in lower architecture than sm_80". Would you have any plan for updating the legion and HTR-solver compatible with RTX3080?

I hope I am not bothering you. Thank you

mariodirenzo commented 2 years ago

While the network is OPA (Omni-path), I used the IBV (Infini-band) for the conduit option.

This choice could be suboptimal in my experience. I would rather use the psm conduit as done for the HPC Kraken @ CERFACS.

prometeo_ConstPropMix.exec: /home/mwkim/htr_stanfrod/src/Utils/task_helper.hpp:136: void TaskHelper::base_gpu_wrapper(const Legion::Task, const std::vectorLegion::PhysicalRegion&, Legion::Context, Legion::Runtime) [with T = LoadMixtureTask; Legion::Context = Legion::Internal::TaskContext*]: Assertion `task->arglen == 0' failed.

are you sure that you are starting from a clean build? This usually error usually happens when the objects built with different setups are mixed in the same executable.

mw-k commented 2 years ago

First one. I tried psm first. But, I got the error as

"configure error: Requested PMI support could not be found".

Thus, I change psm to ibv. I did not try to fix it. But, I should do it at this time...

Second one Yes, I am quite sure. Because I always tried to remove all of them and re-install.

stanfordhpccenter / HTR-solver

Simulation becomes slow due to Legion warning1097 #5