ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

NumPy wrapper: sliced reading from file crashes #161

Closed n01r closed 6 years ago

n01r commented 6 years ago

Hey there, I encountered a segmentation fault that occurs when reading a slice from a linear AdiosVar array when the slice contains a certain index. The index can change (checked different files). The values can be read on their own but not when accessing more than one element.

The ADIOS 1.13.0 I'm using for post-processing has been built with the same flags as the one I used for the creation of the data - just without mpi. I built the numpy wrapper from source using python setup.py. I am using blosc with zstd for compression, full parameters: threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd.

$ adios_config -s
DIR=/users/<USER>/lib/adios-1.13.0_nompi
CFLAGS=-I/users/<USER>/lib/adios-1.13.0_nompi/include -D_NOMPI -DZLIB -I/users/<USER>/lib/zlib-1.2.11/include -DBLOSC -I/users/<USER>/lib/blosc-1.12.1/include -I/users/<USER>/lib/blosc-1.12.1/include
LDFLAGS=-L/users/<USER>/lib/adios-1.13.0_nompi/lib -ladios_nompi -L/users/<USER>/lib/zlib-1.2.11/lib64 -L/users/<USER>/lib/blosc-1.12.1/lib -libverbs -lz -lblosc
Available write methods (in XML <method> element or in adios_select_method()):
    "POSIX"
Available read methods (constants after #include "adios_read.h"):
    ADIOS_READ_METHOD_BP (=0)
Available data transformation methods (in XML transform tags in <var> elements):
    "none"  : No data transform
    "identity"  : Identity transform
    "zlib"  : zlib compression
    "zfp"   : zfp compression
    "blosc" : blosc compression
Available query methods (in adios_query_set_method()):
    ADIOS_QUERY_METHOD_MINMAX (=0)

This is what I observe:

In [1]: import numpy as np
In [2]: import adios as ad
In [3]: path = "014_0060gpus2DCopper30nmLeadingEdge1E-3/simOutput/bp/simData_87040.bp"
In [4]: f = ad.File(path)
In [5]: px = f['/data/87040/particles/H_all/momentum/x']
In [6]: px[3740]
Out[6]: 0.04298079386353493
In [7]: px[3741]
Out[7]: 0.08291889727115631
In [8]: px[3742]
Out[8]: -0.13517844676971436
In [9]: px[3740:3741]
Out[9]: array([ 0.04298079], dtype=float32)
In [10]: px[3741:3742]
Out[10]: array([ 0.0829189], dtype=float32)
In [11]: px[3740:3742]
Segmentation fault (core dumped)

A run with gdb shows

(gdb) run test.py 
Starting program: /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/bin/python test.py
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-61.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00002aaab0977f54 in adios_transform_blosc_pg_reqgroup_completed () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
Missing separate debuginfos, use: zypper install libibverbs1-debuginfo-1.2.0-17.1.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64
(gdb) backtrace
#0  0x00002aaab0977f54 in adios_transform_blosc_pg_reqgroup_completed ()
   from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#1  0x00002aaab09732b7 in adios_transform_pg_reqgroup_completed () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#2  0x00002aaab0972aba in adios_transform_process_all_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#3  0x00002aaab0944b24 in common_read_perform_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#4  0x00002aaab0937157 in adios_perform_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#5  0x00002aaab08b321d in __pyx_f_5adios_3var_read (__pyx_v_self=0x2aaab0bf9688, __pyx_skip_dispatch=<optimized out>, __pyx_optional_args=<optimized out>) at adios.cpp:24203
#6  0x00002aaab089ac1e in __pyx_pf_5adios_3var_12read (__pyx_v_step_scalar=<optimized out>, __pyx_v_fill=<optimized out>, __pyx_v_nsteps=<optimized out>, __pyx_v_from_steps=<optimized out>, 
    __pyx_v_scalar=<optimized out>, __pyx_v_count=<optimized out>, __pyx_v_offset=<optimized out>, __pyx_v_self=0x2aaab0bf9688) at adios.cpp:24495
#7  __pyx_pw_5adios_3var_13read (__pyx_v_self=0x2aaab0bf9688, __pyx_args=<optimized out>, __pyx_kwds=0x2aaaabc60288) at adios.cpp:24461
#8  0x0000555555660364 in _PyCFunction_FastCallDict ()
#9  0x000055555568ef30 in _PyCFunction_FastCallKeywords ()
#10 0x00005555556f2ebc in call_function ()
#11 0x00005555557153e7 in _PyEval_EvalFrameDefault ()
#12 0x00005555556ed8d9 in PyEval_EvalCodeEx ()
#13 0x00005555556ee67c in PyEval_EvalCode ()
#14 0x0000555555768ce4 in run_mod ()
#15 0x00005555557690e1 in PyRun_FileExFlags ()
#16 0x00005555557692e4 in PyRun_SimpleFileExFlags ()
#17 0x000055555576cdaf in Py_Main ()
#18 0x00005555556338be in main ()

What could I specifically look into?

ax3l commented 6 years ago

The data set that we read is a 1D array written by several process groups. At the offset of concern, a process group wrote zero entries. This is an issue we encountered (& fixed) before, e.g. with zlib transforms.

n01r commented 6 years ago

The numpy wrapper version is:

import adios as ad
ad.__version__
'1.13.0'

The blockinfo from the file shows

In[9]: px.blockinfo
Out[9]:
[[AdiosBlockinfo (process_id=0, time_index=1, start=(0,), count=(19,)),
  AdiosBlockinfo (process_id=1, time_index=1, start=(19,), count=(45,)),
  AdiosBlockinfo (process_id=2, time_index=1, start=(64,), count=(1477,)),
  AdiosBlockinfo (process_id=3, time_index=1, start=(1541,), count=(61,)),
  AdiosBlockinfo (process_id=4, time_index=1, start=(1602,), count=(82,)),
  AdiosBlockinfo (process_id=5, time_index=1, start=(1684,), count=(1154,)),
  AdiosBlockinfo (process_id=6, time_index=1, start=(2838,), count=(22,)),
  AdiosBlockinfo (process_id=7, time_index=1, start=(2860,), count=(46,)),
  AdiosBlockinfo (process_id=8, time_index=1, start=(2906,), count=(570,)),
  AdiosBlockinfo (process_id=9, time_index=1, start=(3476,), count=(18,)),
  AdiosBlockinfo (process_id=10, time_index=1, start=(3494,), count=(18,)),
  AdiosBlockinfo (process_id=11, time_index=1, start=(3512,), count=(198,)),
  AdiosBlockinfo (process_id=12, time_index=1, start=(3710,), count=(16,)),
  AdiosBlockinfo (process_id=13, time_index=1, start=(3726,), count=(4,)),
  AdiosBlockinfo (process_id=14, time_index=1, start=(3730,), count=(2,)),
  AdiosBlockinfo (process_id=15, time_index=1, start=(3732,), count=(4,)),
  AdiosBlockinfo (process_id=16, time_index=1, start=(3736,), count=(5,)),
  AdiosBlockinfo (process_id=17, time_index=1, start=(3741,), count=(0,)),
  AdiosBlockinfo (process_id=18, time_index=1, start=(3741,), count=(2,)),
  AdiosBlockinfo (process_id=19, time_index=1, start=(3743,), count=(1,)),
  AdiosBlockinfo (process_id=20, time_index=1, start=(3744,), count=(1,)),
  AdiosBlockinfo (process_id=21, time_index=1, start=(3745,), count=(0,)),
  AdiosBlockinfo (process_id=22, time_index=1, start=(3745,), count=(2,)),
  AdiosBlockinfo (process_id=23, time_index=1, start=(3747,), count=(2,)),
  AdiosBlockinfo (process_id=24, time_index=1, start=(3749,), count=(1,)),
  AdiosBlockinfo (process_id=25, time_index=1, start=(3750,), count=(1,)),
  AdiosBlockinfo (process_id=26, time_index=1, start=(3751,), count=(2,)),
  AdiosBlockinfo (process_id=27, time_index=1, start=(3753,), count=(0,)),
  AdiosBlockinfo (process_id=28, time_index=1, start=(3753,), count=(0,)),
  AdiosBlockinfo (process_id=29, time_index=1, start=(3753,), count=(2,)),
  AdiosBlockinfo (process_id=30, time_index=1, start=(3755,), count=(0,)),
  AdiosBlockinfo (process_id=31, time_index=1, start=(3755,), count=(1,)),
  AdiosBlockinfo (process_id=32, time_index=1, start=(3756,), count=(1,)),
  AdiosBlockinfo (process_id=33, time_index=1, start=(3757,), count=(0,)),
  AdiosBlockinfo (process_id=34, time_index=1, start=(3757,), count=(2,)),
  AdiosBlockinfo (process_id=35, time_index=1, start=(3759,), count=(1,)),
  AdiosBlockinfo (process_id=36, time_index=1, start=(3760,), count=(2,)),
  AdiosBlockinfo (process_id=37, time_index=1, start=(3762,), count=(6,)),
  AdiosBlockinfo (process_id=38, time_index=1, start=(3768,), count=(3,)),
  AdiosBlockinfo (process_id=39, time_index=1, start=(3771,), count=(3,)),
  AdiosBlockinfo (process_id=40, time_index=1, start=(3774,), count=(3,)),
  AdiosBlockinfo (process_id=41, time_index=1, start=(3777,), count=(0,)),
  AdiosBlockinfo (process_id=42, time_index=1, start=(3777,), count=(7,)),
  AdiosBlockinfo (process_id=43, time_index=1, start=(3784,), count=(3,)),
  AdiosBlockinfo (process_id=44, time_index=1, start=(3787,), count=(0,)),
  AdiosBlockinfo (process_id=45, time_index=1, start=(3787,), count=(10,)),
  AdiosBlockinfo (process_id=46, time_index=1, start=(3797,), count=(15,)),
  AdiosBlockinfo (process_id=47, time_index=1, start=(3812,), count=(0,)),
  AdiosBlockinfo (process_id=48, time_index=1, start=(3812,), count=(7,)),
  AdiosBlockinfo (process_id=49, time_index=1, start=(3819,), count=(21,)),
  AdiosBlockinfo (process_id=50, time_index=1, start=(3840,), count=(171,)),
  AdiosBlockinfo (process_id=51, time_index=1, start=(4011,), count=(7,)),
  AdiosBlockinfo (process_id=52, time_index=1, start=(4018,), count=(50,)),
  AdiosBlockinfo (process_id=53, time_index=1, start=(4068,), count=(593,)),
  AdiosBlockinfo (process_id=54, time_index=1, start=(4661,), count=(34,)),
  AdiosBlockinfo (process_id=55, time_index=1, start=(4695,), count=(48,)),
  AdiosBlockinfo (process_id=56, time_index=1, start=(4743,), count=(1264,)),
  AdiosBlockinfo (process_id=57, time_index=1, start=(6007,), count=(16,)),
  AdiosBlockinfo (process_id=58, time_index=1, start=(6023,), count=(28,)),
  AdiosBlockinfo (process_id=59, time_index=1, start=(6051,), count=(1537,))]]
pnorbert commented 6 years ago

I am looking into this.

On Wed, Feb 14, 2018 at 5:04 AM, Marco Garten notifications@github.com wrote:

The numpy wrapper version is:

import adios as ad ad.version'1.13.0'

The blockinfo from the file shows

In[9]: px.blockinfo Out[9]: [[AdiosBlockinfo (process_id=0, time_index=1, start=(0,), count=(19,)), AdiosBlockinfo (process_id=1, time_index=1, start=(19,), count=(45,)), AdiosBlockinfo (process_id=2, time_index=1, start=(64,), count=(1477,)), AdiosBlockinfo (process_id=3, time_index=1, start=(1541,), count=(61,)), AdiosBlockinfo (process_id=4, time_index=1, start=(1602,), count=(82,)), AdiosBlockinfo (process_id=5, time_index=1, start=(1684,), count=(1154,)), AdiosBlockinfo (process_id=6, time_index=1, start=(2838,), count=(22,)), AdiosBlockinfo (process_id=7, time_index=1, start=(2860,), count=(46,)), AdiosBlockinfo (process_id=8, time_index=1, start=(2906,), count=(570,)), AdiosBlockinfo (process_id=9, time_index=1, start=(3476,), count=(18,)), AdiosBlockinfo (process_id=10, time_index=1, start=(3494,), count=(18,)), AdiosBlockinfo (process_id=11, time_index=1, start=(3512,), count=(198,)), AdiosBlockinfo (process_id=12, time_index=1, start=(3710,), count=(16,)), AdiosBlockinfo (process_id=13, time_index=1, start=(3726,), count=(4,)), AdiosBlockinfo (process_id=14, time_index=1, start=(3730,), count=(2,)), AdiosBlockinfo (process_id=15, time_index=1, start=(3732,), count=(4,)), AdiosBlockinfo (process_id=16, time_index=1, start=(3736,), count=(5,)), AdiosBlockinfo (process_id=17, time_index=1, start=(3741,), count=(0,)), AdiosBlockinfo (process_id=18, time_index=1, start=(3741,), count=(2,)), AdiosBlockinfo (process_id=19, time_index=1, start=(3743,), count=(1,)), AdiosBlockinfo (process_id=20, time_index=1, start=(3744,), count=(1,)), AdiosBlockinfo (process_id=21, time_index=1, start=(3745,), count=(0,)), AdiosBlockinfo (process_id=22, time_index=1, start=(3745,), count=(2,)), AdiosBlockinfo (process_id=23, time_index=1, start=(3747,), count=(2,)), AdiosBlockinfo (process_id=24, time_index=1, start=(3749,), count=(1,)), AdiosBlockinfo (process_id=25, time_index=1, start=(3750,), count=(1,)), AdiosBlockinfo (process_id=26, time_index=1, start=(3751,), count=(2,)), AdiosBlockinfo (process_id=27, time_index=1, start=(3753,), count=(0,)), AdiosBlockinfo (process_id=28, time_index=1, start=(3753,), count=(0,)), AdiosBlockinfo (process_id=29, time_index=1, start=(3753,), count=(2,)), AdiosBlockinfo (process_id=30, time_index=1, start=(3755,), count=(0,)), AdiosBlockinfo (process_id=31, time_index=1, start=(3755,), count=(1,)), AdiosBlockinfo (process_id=32, time_index=1, start=(3756,), count=(1,)), AdiosBlockinfo (process_id=33, time_index=1, start=(3757,), count=(0,)), AdiosBlockinfo (process_id=34, time_index=1, start=(3757,), count=(2,)), AdiosBlockinfo (process_id=35, time_index=1, start=(3759,), count=(1,)), AdiosBlockinfo (process_id=36, time_index=1, start=(3760,), count=(2,)), AdiosBlockinfo (process_id=37, time_index=1, start=(3762,), count=(6,)), AdiosBlockinfo (process_id=38, time_index=1, start=(3768,), count=(3,)), AdiosBlockinfo (process_id=39, time_index=1, start=(3771,), count=(3,)), AdiosBlockinfo (process_id=40, time_index=1, start=(3774,), count=(3,)), AdiosBlockinfo (process_id=41, time_index=1, start=(3777,), count=(0,)), AdiosBlockinfo (process_id=42, time_index=1, start=(3777,), count=(7,)), AdiosBlockinfo (process_id=43, time_index=1, start=(3784,), count=(3,)), AdiosBlockinfo (process_id=44, time_index=1, start=(3787,), count=(0,)), AdiosBlockinfo (process_id=45, time_index=1, start=(3787,), count=(10,)), AdiosBlockinfo (process_id=46, time_index=1, start=(3797,), count=(15,)), AdiosBlockinfo (process_id=47, time_index=1, start=(3812,), count=(0,)), AdiosBlockinfo (process_id=48, time_index=1, start=(3812,), count=(7,)), AdiosBlockinfo (process_id=49, time_index=1, start=(3819,), count=(21,)), AdiosBlockinfo (process_id=50, time_index=1, start=(3840,), count=(171,)), AdiosBlockinfo (process_id=51, time_index=1, start=(4011,), count=(7,)), AdiosBlockinfo (process_id=52, time_index=1, start=(4018,), count=(50,)), AdiosBlockinfo (process_id=53, time_index=1, start=(4068,), count=(593,)), AdiosBlockinfo (process_id=54, time_index=1, start=(4661,), count=(34,)), AdiosBlockinfo (process_id=55, time_index=1, start=(4695,), count=(48,)), AdiosBlockinfo (process_id=56, time_index=1, start=(4743,), count=(1264,)), AdiosBlockinfo (process_id=57, time_index=1, start=(6007,), count=(16,)), AdiosBlockinfo (process_id=58, time_index=1, start=(6023,), count=(28,)), AdiosBlockinfo (process_id=59, time_index=1, start=(6051,), count=(1537,))]]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/161#issuecomment-365554717, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLfVLbt3mvIaawJqHRf7RjwvA9AWuks5tUq-0gaJpZM4SFCKT .

pnorbert commented 6 years ago

First question: I cannot even write a zero-length block with zlib transformation into the output because the write segfaults. How do you produce the file? Do you turn off zlib for the zero blocks?

ax3l commented 6 years ago

Hi @pnorbert,

@psychocoderHPC just found the root of the issue and will provide a fix in a few minutes. Affects about half of the transforms: blosc, zlib, bzip2, lz4.

Writing zero-length blocks with transformations is possible since a long time (I think we fixed that together in 1.10 or so) and is an important use case for unstructured, domain-decomposed data. We were writing with blosc (not zlib) where we skip compression on zero-size input in the write transform. Maybe the zlib transform has a bug if it does not do the same - but I seem to remember it worked in the past for us.

Or maybe it's just a misunderstanding of what we do: our overall variable is not zero-sized, it's just individual process groups that contribute zero in parallel writes.

It's just a missing meta-data check on read that is causing the crash right now: see #162

psychocoderHPC commented 6 years ago

@pnorbert The zero length write with zlib is fixed in #165