reloading mesh data in jupyter produces wrong data

PrometheusPi commented 1 year ago

Describe the bug I am using openPMD-api from a jupyter notebook on summit ORNL. When loading mesh data, I get wrong data in my numpy array that is not in the raw adios2 data. This strange data only shows up randomly and not at the first read. It looks like some data is not correctly freed. Deleting the io.Series and series.iterations[i] objects beforehand did not solve the problem. I only observed that strange behavior in data sets that mainly (but not necessarily only) contained zero values.

To Reproduce Compile-able/executable code example to reproduce the problem:

import numpy as np
import openpmd_api as io
print(io.__version__)

sim_path_01 = "../runs_PWFA/001_PWFA_trojan_testRun/"

# for debugging i removed the objects - did not help
if "series" in locals():
    del series

if "it" in locals():
    del it

series = io.Series(sim_path_01 + "/simOutput/openPMD/simOutput_%T.bp", access=io.Access_Type.read_only)

it = series.iterations[0]

for c in range(10):
    h = it.meshes["E"]["y"]
    E_y = h[:, :, :]
    #E_y_SI = h.unit_SI

    series.flush()
    print(np.sum(E_y))

first run after kernel restart produced:

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

second run resulted in:

nan
0.0
nan
0.0
nan
0.0
nan
0.0
nan
0.0

(this is more or less random, can also be numbers, but the module 2 pattern persists)

Expected behavior I would expect the same data to be loaded every time. (in my case this would be all zeros). (all zeros has been confirmed by running bpls simOutput_000000.bp/ --dump /data/0/fields/E/y | grep -v "0 0 0 0 0 0".)

Software Environment

version of openPMD-api: 0.15.1
installed openPMD-api via: conda-forge in conda env
operating system: CentOS 8
machine: summit / ORNL / slate CPU only
name and version of Python implementation: Python 3.10.11 IPython 8.13.2

Additional context Strangely, it seems to only affect the first iteration of PIConGPU output. Compression was not used.

PrometheusPi commented 1 year ago

Plot of data after first execution: grafik

Plot of data after second execution (used range(9) instead of range(10) to trigger data with nan): grafik

PrometheusPi commented 1 year ago

I could reproduce the same behavior in python without jupyter:

by executing python test.py with `test.py being:

import numpy as np
import openpmd_api as io
print(io.__version__)

sim_path_01 = "../runs_PWFA/001_PWFA_trojan_testRun/"

# for debugging i removed the objects - did not help
if "series" in locals():
    del series

if "it" in locals():
    del it

series = io.Series(sim_path_01 + "/simOutput/openPMD/simOutput_%T.bp", access=io.Access_Type.read_only)

it = series.iterations[0]

for c in range(10):
    h = it.meshes["E"]["y"]
    E_y = h[:, :, :]
    #E_y_SI = h.unit_SI

    series.flush()
    print(np.sum(E_y))

# for debugging i removed the objects - did not help
if "series" in locals():
    del series

if "it" in locals():
    del it

series = io.Series(sim_path_01 + "/simOutput/openPMD/simOutput_%T.bp", access=io.Access_Type.read_only)

it = series.iterations[0]

for c in range(10):
    h = it.meshes["E"]["y"]
    E_y = h[:, :, :]
    #E_y_SI = h.unit_SI

    series.flush()
    print(np.sum(E_y))

I get the following output:

0.15.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
/ccs/proj/......./modelling/lib/python3.10/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
nan
0.0
nan
0.0
nan
0.0
nan
0.0
nan
0.0

franzpoeschel commented 1 year ago

This sounds.. interesting

What is the output of bpls -D simData_000000.bp? (This would show if there are undefined regions in the file). In the case of undefined regions, bpls will print all zeros while the APIs will leave the regions uninitialized. Does the behavior change when calling series.close() instead of del series (new API in openPMD-api 0.15, better suited for Python than del)?

franzpoeschel commented 1 year ago

You can also try if this behavior can be reproduced without openPMD-api, using ADIOS2 directly: https://adios2.readthedocs.io/en/latest/api_high/api_high.html#python-read-step-by-step-example Depending on how much effort you want to put into debugging..

PrometheusPi commented 1 year ago

The bpls -D simData_000000.bp is attached: openPMD.debug.txt

PrometheusPi commented 1 year ago

Using series.close() instead of del series produced the same behavior.

franzpoeschel commented 1 year ago

Do I see it correctly that there are unwritten regions?

  float     /data/0/fields/E/y                                        {1, 960, 768}
        step 0: 
          block  0: [0:0,   0: 41,   0:127]
          block  1: [0:0,  49:112,   0:127]
…
          block 32: [0:0,   0: 41, 256:383]
          block 33: [0:0,  49:112, 256:383]

The blocks for 41 < y < 49 are not written. This looks like a PIConGPU indexing bug to me. Was there recently a change in the indexing of fields in PIConGPU, or are you running some arcanely-configured simulation?

PrometheusPi commented 1 year ago

That is very interesting. I am using today's dev version ad1f7a27b2a7d4f833830ef115b302314c92124c.

franzpoeschel commented 1 year ago

@psychocoderHPC do you have an idea what could be causing this?

PrometheusPi commented 1 year ago

Just for completion: The adios example code you linked produces:

variable_name: /data/0/fields/E/y
    AvailableStepsCount: 1
    Max: 0
    Min: 0
    Shape: 1, 960, 768
    SingleValue: false
    Type: float

PrometheusPi commented 1 year ago

@franzpoeschel my openPMD-api call in PIConGPU looks as follows:

--openPMD.period 1000  --openPMD.file simOutput  --openPMD.ext bp  --openPMD.range :,:,64

franzpoeschel commented 1 year ago

My guess is that it would need a bit of randomization such as "run the thing twice" to trigger the undefined behavior. (Alternatively initializing the buffer with a default value before flushing could work too.)

But, having found a likely cause for this, I don't think we need to go there now.

PrometheusPi commented 1 year ago

Since I observed problems on hemera v100 last week showing empty regions in output that allinged with GPUs, I thought a node was defective. Ccecking this with pbls -D shows e.g.:

  float     /data/0/fields/E/y                                        {1, 1984, 128}
        step 0:
          block  0: [0:0,    0:  63,   0:127]
          block  1: [0:0,    1:  64,   0:127]
          block  2: [0:0,   91: 154,   0:127]
          block  3: [0:0,  132: 195,   0:127]
          block  4: [0:0,  206: 269,   0:127]
          block  5: [0:0,  299: 362,   0:127]
          block  6: [0:0,  384: 447,   0:127]
          block  7: [0:0,  395: 458,   0:127]
          block  8: [0:0,  501: 564,   0:127]
          block  9: [0:0,  519: 582,   0:127]
          block 10: [0:0,  638: 701,   0:127]
          block 11: [0:0,  644: 707,   0:127]
          block 12: [0:0,  726: 789,   0:127]
          block 13: [0:0,  775: 838,   0:127]
          block 14: [0:0,  873: 936,   0:127]
          block 15: [0:0,  900: 963,   0:127]
          block 16: [0:0,  996:1059,   0:127]
          block 17: [0:0, 1030:1093,   0:127]
          block 18: [0:0, 1152:1215,   0:127]
          block 19: [0:0, 1192:1255,   0:127]
          block 20: [0:0, 1225:1288,   0:127]
          block 21: [0:0, 1302:1365,   0:127]
          block 22: [0:0, 1366:1429,   0:127]
          block 23: [0:0, 1428:1491,   0:127]
          block 24: [0:0, 1509:1572,   0:127]
          block 25: [0:0, 1600:1663,   0:127]
          block 26: [0:0, 1631:1694,   0:127]
          block 27: [0:0, 1728:1791,   0:127]
          block 28: [0:0, 1741:1804,   0:127]
          block 29: [0:0, 1856:1919,   0:127]
          block 30: [0:0, 1865:1928,   0:127]
          block 31: [0:0, 1958:1983,   0:127]

32 GPUs in y direction.

64-91 is not covered, but block 2 and 3 cover both: 132-154.

PrometheusPi commented 1 year ago

Since this clearly is a PIConGPU issue, I will close this issue here and open an issue in the PIConGPU repo.

PrometheusPi commented 1 year ago

Thanks @franzpoeschel for your quick help.

openPMD / openPMD-api

reloading mesh data in jupyter produces wrong data #1485