Closed PrometheusPi closed 1 year ago
Plot of data after first execution:
Plot of data after second execution (used range(9)
instead of range(10)
to trigger data with nan
):
I could reproduce the same behavior in python without jupyter:
by executing python test.py
with
`test.py
being:
import numpy as np
import openpmd_api as io
print(io.__version__)
sim_path_01 = "../runs_PWFA/001_PWFA_trojan_testRun/"
# for debugging i removed the objects - did not help
if "series" in locals():
del series
if "it" in locals():
del it
series = io.Series(sim_path_01 + "/simOutput/openPMD/simOutput_%T.bp", access=io.Access_Type.read_only)
it = series.iterations[0]
for c in range(10):
h = it.meshes["E"]["y"]
E_y = h[:, :, :]
#E_y_SI = h.unit_SI
series.flush()
print(np.sum(E_y))
# for debugging i removed the objects - did not help
if "series" in locals():
del series
if "it" in locals():
del it
series = io.Series(sim_path_01 + "/simOutput/openPMD/simOutput_%T.bp", access=io.Access_Type.read_only)
it = series.iterations[0]
for c in range(10):
h = it.meshes["E"]["y"]
E_y = h[:, :, :]
#E_y_SI = h.unit_SI
series.flush()
print(np.sum(E_y))
I get the following output:
0.15.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
/ccs/proj/......./modelling/lib/python3.10/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: invalid value encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
nan
0.0
nan
0.0
nan
0.0
nan
0.0
nan
0.0
This sounds.. interesting
What is the output of bpls -D simData_000000.bp
? (This would show if there are undefined regions in the file). In the case of undefined regions, bpls
will print all zeros while the APIs will leave the regions uninitialized.
Does the behavior change when calling series.close()
instead of del series
(new API in openPMD-api 0.15, better suited for Python than del
)?
You can also try if this behavior can be reproduced without openPMD-api, using ADIOS2 directly: https://adios2.readthedocs.io/en/latest/api_high/api_high.html#python-read-step-by-step-example Depending on how much effort you want to put into debugging..
The bpls -D simData_000000.bp
is attached:
openPMD.debug.txt
Using series.close()
instead of del series
produced the same behavior.
Do I see it correctly that there are unwritten regions?
float /data/0/fields/E/y {1, 960, 768}
step 0:
block 0: [0:0, 0: 41, 0:127]
block 1: [0:0, 49:112, 0:127]
…
block 32: [0:0, 0: 41, 256:383]
block 33: [0:0, 49:112, 256:383]
The blocks for 41 < y < 49
are not written. This looks like a PIConGPU indexing bug to me. Was there recently a change in the indexing of fields in PIConGPU, or are you running some arcanely-configured simulation?
That is very interesting. I am using today's dev
version ad1f7a27b2a7d4f833830ef115b302314c92124c
.
@psychocoderHPC do you have an idea what could be causing this?
Just for completion: The adios example code you linked produces:
variable_name: /data/0/fields/E/y
AvailableStepsCount: 1
Max: 0
Min: 0
Shape: 1, 960, 768
SingleValue: false
Type: float
@franzpoeschel my openPMD-api call in PIConGPU looks as follows:
--openPMD.period 1000 --openPMD.file simOutput --openPMD.ext bp --openPMD.range :,:,64
My guess is that it would need a bit of randomization such as "run the thing twice" to trigger the undefined behavior. (Alternatively initializing the buffer with a default value before flushing could work too.)
But, having found a likely cause for this, I don't think we need to go there now.
Since I observed problems on hemera v100 last week showing empty regions in output that allinged with GPUs, I thought a node was defective. Ccecking this with pbls -D
shows e.g.:
float /data/0/fields/E/y {1, 1984, 128}
step 0:
block 0: [0:0, 0: 63, 0:127]
block 1: [0:0, 1: 64, 0:127]
block 2: [0:0, 91: 154, 0:127]
block 3: [0:0, 132: 195, 0:127]
block 4: [0:0, 206: 269, 0:127]
block 5: [0:0, 299: 362, 0:127]
block 6: [0:0, 384: 447, 0:127]
block 7: [0:0, 395: 458, 0:127]
block 8: [0:0, 501: 564, 0:127]
block 9: [0:0, 519: 582, 0:127]
block 10: [0:0, 638: 701, 0:127]
block 11: [0:0, 644: 707, 0:127]
block 12: [0:0, 726: 789, 0:127]
block 13: [0:0, 775: 838, 0:127]
block 14: [0:0, 873: 936, 0:127]
block 15: [0:0, 900: 963, 0:127]
block 16: [0:0, 996:1059, 0:127]
block 17: [0:0, 1030:1093, 0:127]
block 18: [0:0, 1152:1215, 0:127]
block 19: [0:0, 1192:1255, 0:127]
block 20: [0:0, 1225:1288, 0:127]
block 21: [0:0, 1302:1365, 0:127]
block 22: [0:0, 1366:1429, 0:127]
block 23: [0:0, 1428:1491, 0:127]
block 24: [0:0, 1509:1572, 0:127]
block 25: [0:0, 1600:1663, 0:127]
block 26: [0:0, 1631:1694, 0:127]
block 27: [0:0, 1728:1791, 0:127]
block 28: [0:0, 1741:1804, 0:127]
block 29: [0:0, 1856:1919, 0:127]
block 30: [0:0, 1865:1928, 0:127]
block 31: [0:0, 1958:1983, 0:127]
32 GPUs in y direction.
64-91 is not covered, but block 2 and 3 cover both: 132-154.
Since this clearly is a PIConGPU issue, I will close this issue here and open an issue in the PIConGPU repo.
Thanks @franzpoeschel for your quick help.
Describe the bug I am using openPMD-api from a jupyter notebook on summit ORNL. When loading mesh data, I get wrong data in my numpy array that is not in the raw adios2 data. This strange data only shows up randomly and not at the first read. It looks like some data is not correctly freed. Deleting the
io.Series
andseries.iterations[i]
objects beforehand did not solve the problem. I only observed that strange behavior in data sets that mainly (but not necessarily only) contained zero values.To Reproduce Compile-able/executable code example to reproduce the problem:
first run after kernel restart produced:
second run resulted in:
(this is more or less random, can also be numbers, but the module 2 pattern persists)
Expected behavior I would expect the same data to be loaded every time. (in my case this would be all zeros). (all zeros has been confirmed by running
bpls simOutput_000000.bp/ --dump /data/0/fields/E/y | grep -v "0 0 0 0 0 0"
.)Software Environment
Additional context Strangely, it seems to only affect the first iteration of PIConGPU output. Compression was not used.