lanl / PPT

Performance Prediction Toolkit
51 stars 12 forks source link

PPT-GPU - faulty input or simulation bug? #1

Closed lorenzbraun closed 4 years ago

lorenzbraun commented 4 years ago

Hi, i am currently trying to use PPT-GPU to make some predictions on workloads and many simulation files do not run as expected. Not sure if my faulty input is to blame or something else if not working.

The output when running one of my simulations is the following:

python simulations/V100_3mm.exe_s1__Z11mm3_kernel3PfS_S_m_16_32_0.py
===========================================
----------SIMIAN-PIE PDES ENGINE-----------
===========================================
MPI: OFF
('[ERROR] with instruction: ', ['iALU', 4, 120], 'at index', 120)
Traceback (most recent call last):
  File "simulations/V100_3mm.exe_s1__Z11mm3_kernel3PfS_S_m_16_32_0.py", line 94, in <module>
    simianEngine.run()
  File "PPT/code/simian/simian-master/SimianPie/simian.py", line 127, in run
    service(event["data"], event["tx"], event["txId"]) #Receive
  File "simulations/V100_3mm.exe_s1__Z11mm3_kernel3PfS_S_m_16_32_0.py", line 59, in GPU_APP_Handler
    self.startProcess("app", self)
  File "PPT/code/simian/simian-master/SimianPie/entity.py", line 101, in startProcess
    return proc.wake(proc, *args)
  File "PPT/code/simian/simian-master/SimianPie/process.py", line 47, in wake
    return co.switch(*args)
  File "simulations/V100_3mm.exe_s1__Z11mm3_kernel3PfS_S_m_16_32_0.py", line 48, in app
    core.time_compute(GPU_tasklist, simianEngine.now, True)
  File "PPT/code/hardware/processors_new.py", line 2480, in time_compute
    self.node.accelerators[item[1]].kernel_call(item[2], item[3], item[4], item[5], item[6], item[7], start)
  File "PPT/code/hardware/accelerators.py", line 107, in kernel_call
    self.step(block_list, cycles, stats)
  File "PPT/code/hardware/accelerators.py", line 208, in step
    warps_issued+=block.step(warps_issued, cycles)
  File "PPT/code/hardware/accelerators.py", line 249, in step
    if warp.step(cycles):
  File "PPT/code/hardware/accelerators.py", line 297, in step
    if not self.process_inst(cycles):
  File "PPT/code/hardware/accelerators.py", line 316, in process_inst
    if max_dep < self.completions[i]:
IndexError: list index out of range

You will find the simulation file here: https://gist.github.com/lorenzbraun/016b8c9bba4fcce5673f7aa5108a042f

The PPT repository is expected to be in the same folder than the script (i changed the path.append in line 11 and 13).

Is the input faulty or lies the problem somewhere else? I appreciate any help.

Best regards Lorenz

chennupati commented 4 years ago

Hi Lorenz,

Did you check if you have the package "greenlet" installed.

pip install greenlet

lorenzbraun commented 4 years ago

hi chennupati, thanks for the tip. greenlet is already installed. I am using anaconda 2 by the way. Any other hints?

yehiaArafa commented 4 years ago

Hi Lorenz,

There is a problem with the dependency in your instruction's takslist. If you can upload your PTX file, I can try to regenerate the tasklist with the PTXParser file and see exactly what is going wrong.

lorenzbraun commented 4 years ago

hi yehia,

here my ptx file https://gist.github.com/lorenzbraun/83ca2f0693f397d53b24f00c31a0f954 The kernel i used for the simulation was _Z11mm3_kernel3PfS_S_m

yehiaArafa commented 4 years ago

Hi Lorenz,

I have generated the tasklist and it worked fine. How did you get your tasklist? This is the command I used: python PTXParser.py 3mm.ptx _Z11mm3_kernel3PfS_S_m 120 Volta > tasklist.txt

For example, I am assuming that the last BB has a loop count of 120. because there is only 1 loop in this kernel.

lorenzbraun commented 4 years ago

i used a loop count of 512. A have also other workloads with even higher loop counts. The tasklists can get really large which might be a problem i guess. Is there an upper limit for the tasklist length?

yehiaArafa commented 4 years ago

No, I have tried up to 12k loop count and it worked fine. It will take a little more time of course.

Here is the tasklist for 512, I have generated it the same way I told you in the above comment.

https://gist.github.com/yehiaArafa/3dfbf3aa9627370e740b52b5c66b83bf

Perhaps the problem is with your python version. I am using python 2.7 here.

lorenzbraun commented 4 years ago

Thanks a lot! Your tasklist works. I guess i just need to find out why my tasklists are broken.

Here an extract of the broken tasklist:

[['PARAM_MEM_ACCESS', 'LOAD'],
 ['PARAM_MEM_ACCESS', 'LOAD'],
 ['PARAM_MEM_ACCESS', 'LOAD'],
 ['PARAM_MEM_ACCESS', 'LOAD'],
 ['iALU', 4],
 ['iALU', 4],
 ['iALU', 4],
 ['iALU', 4],
 ['iALU', 4],
 ['iALU', 4, 120],
 ['iALU', 4],
...

The additional number is a dependency rigth? I think the error comes from dependencies which are not there yet.

yehiaArafa commented 4 years ago

Exactly, the dependency is broken. I am not exactly sure how or why this issue is happening. Please let us know if you figured it out.

lorenzbraun commented 4 years ago

Hi yehia, found where the broken dependencies came from. Since i have many benchmarks i was automating the tasklist generation. The PTXParser is included by another python which feeds all the ptx files and kernel names which i would like to simulate into the parser. The PTXParser.py uses global variables to manage its internal state. When it is called multiple times the global variables are only initialized at the very beginning. I have fixed it. Its ugly but was the shortest things that should work.

Added this function:

+def init_globals():
+    global out_inst_ctr
+    global dependencies
+    global num_of_loops
+    global provided_loops
+    # For dependencies 
+    out_inst_ctr = 0
+    dependencies = {}
+
+    #For error checking
+    num_of_loops = 0
+    provided_loops = 0
+

and call it every time get_tasklist is invoked:

 def get_tasklist(structure, alu_latencies):
+    init_globals()
     task_list = []
     for blocks in structure:
         task_list += get_block_tasklist(blocks, alu_latencies)
...