ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

Mix of compressed + uncompressed data causing read error #170

Closed n01r closed 6 years ago

n01r commented 6 years ago

I am currently running large simulation jobs with 2400 GPU nodes on a SLURM file system. I noticed that sometimes I could not read my particle data with the ADIOS NumPy wrappers. That only happens with the simulations that have the most particles in them.

I am using ADIOS 1.13.0 and blosc with zstd for compression. The full parameters I use for production are: threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd.

path = "simData_16600.bp"
f = ad.file(path)

w = f["/data/16600/particles/H_all/weighting"]
w_unit = w.attrs['unitSI'].value
print(w.dims[0])
147455991

print(w[0])
[Out]: 3.479555831394943e-36
print(w[100])
[Out]: 2.3418408851396162e-38
print(w[1000])
[Out]: 0.0

and this

ERROR: Error: Variable (id=36) out of bound 1(the data in dimension 1 to read is 1 
elements from index 1000 but the actual data is [0,828])

@psychocoderHPC and I found out that this happens for datasets where there is a mix between compressed particle data and uncompressed particle data. Unfortunately, the datasets where this happens are quite large (~1-2TB scale). I also do not know how to reproduce it in smaller scale. Therefore I uploaded the output of bpdump and the verbose output of bpmeta (links expire on July 14 2018).

bpdump - password: bpdump https://owncloud.hzdr.de/index.php/s/MawIFsaGTkSI9R1

bpmeta - password: bpmeta https://owncloud.hzdr.de/index.php/s/slZ416U8VAKLq3e

One can see from the error that in data/16600/particles/H_all/weighting there seems to be a misinterpreted data chunk size.

Lines 98556f in the bpdump output show:

Offset(985682125)   Payload Offset(985682241)   File Index(2)   Time Index(1)   Dims (l:g:o): (0:147455991:0)
Offset(1634343392)  Payload Offset(1634343561)  File Index(2)   Time Index(1)   Dims (l:g:o): (829) Transform type: blosc compression (ID = 11) Pre-transform datatype: real    Pre-transform dims (l:g:o): (184320:147455991:0)

There happens a switch between uncompressed data to compressed data. We suppose that the (829) is interpreted as an array size and therefore the error message says that

the actual data is [0,828]

psychocoderHPC commented 6 years ago

Note: we using MPI_Aggregators and running bpmeta after ther simulation by hand.

@n01r: could you please post(as download) also the output of bpmeta -v -v -v DIR @adiosteam in the output of bpmeta it locks like the datasets of the group H_all is handled as non transformed data but later in bpdump we can see that it is an mix of transformed and non transformed data. We wrote all our data as transdormed data, it looks like adios is creating the transformed and non transformed groups internally. -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

pnorbert commented 6 years ago

@psychocoderHPC, please clarify this. So even though you set the compression for this variable on all processes, ADIOS seems to not do that on every process? Thanks Norbert

On Fri, Mar 16, 2018 at 3:28 PM, René Widera notifications@github.com wrote:

Note: we using MPI_Aggregators and running bpmeta after ther simulation by hand.

@n01r: could you please post(as download) also the output of bpmeta -v -v -v DIR @adiosteam in the output of bpmeta it locks like the datasets of the group H_all is handled as non transformed data but later in bpdump we can see that it is an mix of transformed and non transformed data. We wrote all our data as transdormed data, it looks like adios is creating the transformed and non transformed groups internally.

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-373820781, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLScsumSgXdA36445enbxxX__CaHcks5tfBJUgaJpZM4SuS8j .

psychocoderHPC commented 6 years ago

All processes uses the same calls to adios and set compression. The only difference is that some MPI ranks write zero elements (some ranks have no particles) but even than adios is called.

My assumtion is that a few ranks within a adios aggregator group wrote particle and some not. If the first datablock in such a group wrote zero data bpmeta will later interprete that all ranks in the mpi group did not used the tranformation layer.

Sry if I used the wrong namings I need to read the file specification first to get an overview of the internal meta and data structure of adios. -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

n01r commented 6 years ago

@psychocoderHPC Both the bpmeta -vvv and the bpdump output are in the description as password-protected download links.

Edit: I tried them but these work for you guys as well, right?

pnorbert commented 6 years ago

@psychocoderHPC Your assumption is correct. ADIOS derives from the very first block that this is a 1D real array without transformation and then it assumes that's the same for all other blocks. In general, dimensionality is expected to be the same across all blocks, and this expectation was extended to the transformation as well.

To fix this problem, we need to filter out the zero blocks when deriving the basic information for a given variable in the read process. We should be able to do this.

On Fri, Mar 16, 2018 at 3:55 PM, Marco Garten notifications@github.com wrote:

@psychocoderHPC https://github.com/psychocoderhpc Both the bpmeta -vvv and the bpdump output are in the description as password-protected download links.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-373827401, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLVkeax_SQOAwzlTbss9Nze_160Glks5tfBitgaJpZM4SuS8j .

pnorbert commented 6 years ago

Quick note: @psychocoderHPC you fixed this problem in a different way already in issue #161 in February. There you allowed to write zero blocks with transformation, so then the problem above cannot occur. For this, one has to run the simulation (the data writer) with adios master instead of 1.13.

It will take a while to find the place in adios code where the fix to reading the data file produced by adios 1.13 can be made.

On Fri, Mar 16, 2018 at 4:30 PM, Norbert Podhorszki < norbert.podhorszki@gmail.com> wrote:

@psychocoderHPC Your assumption is correct. ADIOS derives from the very first block that this is a 1D real array without transformation and then it assumes that's the same for all other blocks. In general, dimensionality is expected to be the same across all blocks, and this expectation was extended to the transformation as well.

To fix this problem, we need to filter out the zero blocks when deriving the basic information for a given variable in the read process. We should be able to do this.

On Fri, Mar 16, 2018 at 3:55 PM, Marco Garten notifications@github.com wrote:

@psychocoderHPC https://github.com/psychocoderhpc Both the bpmeta -vvv and the bpdump output are in the description as password-protected download links.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-373827401, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLVkeax_SQOAwzlTbss9Nze_160Glks5tfBitgaJpZM4SuS8j .

psychocoderHPC commented 6 years ago

@n01r forgot to write that he is using the latest adios with the cherry pick of my fix #161. I also tried to find the code place where the block dimensions get derived but I can't find it. Could it be that also the data produced by bpmeta are wrong? For me it looks like that adios stored somewhere in the root of the tree which describes the data(variable) if a group contains tranformed data but from the verbose output it looks like the dataset H_all is marked as not transformed. -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

n01r commented 6 years ago

That's right, I'm using the patched version. Sorry, forgot to add that.

pnorbert commented 6 years ago

I believe that when the data was written, the simulation/adios did not write transformed zero length blocks but simple uncompressed zero blocks (zero blocks written as REAL arrays with no transformation). Either the simulation did not turn on compression for the zero blocks, or the adios you were using somehow did not turn on compression for the zero blocks.

In the bpdump output, from line 98514

Var (Group) [ID]: //data/16600/particles/H_all/weighting (data) [237]
        Datatype: real
        Vars Characteristics: 2400
        Offset(482590130)       Payload Offset(482590246)       File
Index(0)   Time Index(1)   Dims (l:g:o): (0:147455991:0)

The Datatype: real indicates here (as well as the first entry in the list), that the first block was written as non-transformed.

bpmeta has nothing to do with this. It does not interpret the index, it simply merges them from subfiles (you will find the same info in subfile 0). The type of the variable is actually stored in the index as a single entry (because we can't have blocks with multiple types), and that info is derived in the writing step from the type given in the first block. Which is a zero block in your case, and thus the type for the variable is recorded as real, instead of byte (the type of a blosc-compressed block).

Consequently then, at read time, each block is interpreted as a real array (4 bytes per element) without transformation. Since every transformed block defines the local size only (829), their offsets default to zero and the entire variable is believed to have data only in the [0,828] range (though many-many blocks for that range, and none for the rest [829..147455991]).

Can you double check, that for this data set in question, the simulation source actually turns on the transformation? I am not aware that any version of adios itself did filter out zero length blocks from the transformations. The current master does not do that.

I added test/suite/programs/zerolength.c to master which can be built with make check, and which can re-create this problem if line 96 is un-commented to NOT transform the block on rank 0 (which is also happens to be the zero length block in this test). Running on 4 processes, it will aggregate into two subfiles, the first one exhibiting this mix of types and transform/no-transform info in the subfile. "bpls zerolength.bp -d t" will show similar error to yours. But if transformation is set on all processors, the test will work fine.

On Sat, Mar 17, 2018 at 12:17 PM, Marco Garten notifications@github.com wrote:

That's right, I'm using the patched version. Sorry, forgot to add that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-373932332, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLSybqrHyGuOpP0lYX_fCJK4lSM2jks5tfTcbgaJpZM4SuS8j .

psychocoderHPC commented 6 years ago

@pnorbert I tested your little example. The example is not crashing if disable transformation for rank zero during the bpls but the values after reading are wrong.

I checked our code again and saw that we disable the transformation on the rank where we write zero data. The reason for this disable is that not all transforms can handle zero sized data (e.g. zfp with adios_set_transform (var_t, "zfp:precision=16");). As I remember correct zero sized transforms was not supported in adios in the past, that is the reason why we disabled transformations for zero sized data.

This means that it would work to active the transforms for zero sized data for blosc, zlib but not for other transforms which will not be generic.

Is it possible that adios read is checking each subgroup for the properties (transformation, ...) instead to assume that all subgroups using the same transform?

psychocoderHPC commented 6 years ago

@pnorbert Is it possible to fix this issue by reading the variable definitions in adios for each sub group or is there conceptional anything which avoid it like streaming, ... ?

pnorbert commented 6 years ago

Sorry, I don't find any way to fix reading your existing data. I thought of filtering out the zero blocks at reading, but I would make a big mess inside the (spaghetti) code.

In general, we need all transformations work with zero sized blocks, so that users don't need to worry about setting a transformation per block. Which was never a supported step anyway, transformations can be turned on/off for each variable but not per block for one variable.

On Mon, Mar 26, 2018 at 9:42 AM, René Widera notifications@github.com wrote:

@pnorbert https://github.com/pnorbert Is it possible to fix this issue by reading the variable definitions in adios for each sub group or is there conceptional anything which avoid it like streaming, ... ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-376170093, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLaXz1hEtNvJ3XPKq_ya8jnQk0X9_ks5tiPBOgaJpZM4SuS8j .

ax3l commented 6 years ago

Ok, so we will compress all blocks on our side, including zeroes, and we need to add a test for each transform in ADIOS to check if zero-writes are working (and fix the failing ones, such as zfp, sz?, ...).

n01r commented 6 years ago

It's unfortunate that some of the particle data cannot be recovered then but proceeding like @psychocoderHPC and @ax3l suggested will at least make sure that future data can be read.

pnorbert commented 6 years ago

I attacked this problem with adding an option (-z) to bpmeta, which removes all non-transformed, zero-sized blocks from the index for all transformed variables. Can you please get the latest bpmeta source and copy it into your adios build and test if this fixes your data. I will overwrite the global .bp file, so please save that first so that we can compare the before and after in case there are some issues.

https://raw.githubusercontent.com/ornladios/ADIOS/master/utils/bpmeta/bpmeta.c

On Mon, Mar 26, 2018 at 11:04 AM, Marco Garten notifications@github.com wrote:

It's unfortunate that some of the particle data cannot be recovered then but proceeding like @psychocoderHPC https://github.com/psychocoderHPC and @ax3l https://github.com/ax3l suggested will at least make sure that future data can be read.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-376198653, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLQ1zDc8YryeXEV8mHqOLCr_PnNWkks5tiQOVgaJpZM4SuS8j .

ax3l commented 6 years ago

ah excellent, we create the .bp files in post-processing anyway! Nice solution for existing data, thank you so much!

@n01r can you confirm this works?

ax3l commented 6 years ago

Ok, so we will compress all blocks on our side, including zeroes, and we need to add a test for each transform in ADIOS to check if zero-writes are working (and fix the failing ones, such as zfp, sz?, ...).

@pnorbert can you please still create test cases for this? I think transforms failing on zero-blocks need fixing...

n01r commented 6 years ago

Hmm, reading still doesn't seem to work. I downloaded a new adios-1.13.0 and replaced bpmeta.c with the file from the link @pnorbert posted and built the code as usual. I made a backup of my old simData_16600.bp and ran bpmeta -z simData_16600.bp again. (Once with and once without verbose output because it's easier to look through the one without)

See the logs here:

bpmeta_non-verbose_16600.log, pw: bpmeta https://owncloud.hzdr.de/index.php/s/rF7d2bMjk1jiJc9 bpmeta_verbose_16600.log, pw: bpmeta https://owncloud.hzdr.de/index.php/s/aYj0rl3zM3q95He

When I try reading the same AdiosVar as before (/data/{}/particles/H_all/weighting) I am getting the same error.

ERROR: Error: Variable (id=36) out of bound 1(the data in dimension 1 to read is 147455991 elements from index 0 but the actual data is [0,828])

from bpmeta_non-verbose_16600.log

Thread 0: *** Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks ***
Thread 0: *** Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks ***
Thread 0: *** Variable /data/16600/particles/H_all/weighting has 2 ZERO, NON-TRANSFORMED blocks and 17 transformed blocks among 19 blocks ***
Thread 0: *** Variable /data/16600/particles/H_all/weighting has 15 ZERO, NON-TRANSFORMED blocks and 4 transformed blocks among 19 blocks ***
Thread 0: *** Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks ***
Thread 0: *** Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks ***
Thread 0: *** Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks ***
pnorbert commented 6 years ago

I suspect I made a mistake. I only remove them if there are transformed blocks (i.e. I think the variable has transformation). But you have whole sets where all of them are zero blocks. I will remove them in any case. To confirm this suspicion, can you make a bpdump again? Thanks

On Tue, Mar 27, 2018 at 9:42 AM, Marco Garten notifications@github.com wrote:

Hmm, reading still doesn't seem to work. I downloaded a new adios-1.13.0 and replaced bpmeta.c with the file from the link @pnorbert https://github.com/pnorbert posted and built the code as usual. I made a backup of my old simData_16600.bp and ran bpmeta -z simData_16600.bp again. (Once with and once without verbose output because it's easier to look through the one without)

See the logs here:

bpmeta_non-verbose_16600.log, pw: bpmeta https://owncloud.hzdr.de/index.php/s/rF7d2bMjk1jiJc9 bpmeta_verbose_16600.log, pw: bpmeta https://owncloud.hzdr.de/index.php/s/aYj0rl3zM3q95He

When I try reading the same AdiosVar as before (/data/{}/particles/H_all/ weighting) I am getting the same error.

ERROR: Error: Variable (id=36) out of bound 1(the data in dimension 1 to read is 147455991 elements from index 0 but the actual data is [0,828])

from bpmeta_non-verbose_16600.log

Thread 0: Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks Thread 0: Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks Thread 0: Variable /data/16600/particles/H_all/weighting has 2 ZERO, NON-TRANSFORMED blocks and 17 transformed blocks among 19 blocks Thread 0: Variable /data/16600/particles/H_all/weighting has 15 ZERO, NON-TRANSFORMED blocks and 4 transformed blocks among 19 blocks Thread 0: Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks Thread 0: Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks Thread 0: Variable /data/16600/particles/H_all/weighting has 19 ZERO, NON-TRANSFORMED blocks and 0 transformed blocks among 19 blocks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-376529403, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLSwKy4REQKzxh_mV7XKTK38dOurBks5tikGsgaJpZM4SuS8j .

n01r commented 6 years ago

Okay, I made a dump :baby: (sorry, that was inappropriate :D )

bpdump_16600.log, pw: bpdump https://owncloud.hzdr.de/index.php/s/EVO0JYo9RNWXB0w

pnorbert commented 6 years ago

And I analyzed your dump and found the disease... I updated bpmeta in the master branch, now it will remove all zero blocks. Please check after running it, that in the bpdump output, for the variable

Var (Group) [ID]: //data/16600/particles/H_all/weighting (data) [237] Datatype: real Vars Characteristics: 2020 Offset(482590130) Payload Offset(482590246) File Index(0) Time Index(1) Dims (l:g:o): (0:147455991:0 )

the type becomes "byte" and all entries have transformation info.

On Tue, Mar 27, 2018 at 10:45 AM, Marco Garten notifications@github.com wrote:

Okay, I made a dump 👶 (sorry, that was inappropriate :D )

bpdump_16600.log, pw: bpdump https://owncloud.hzdr.de/index.php/s/EVO0JYo9RNWXB0w

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/170#issuecomment-376552238, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRnAswkkrRvodkIvvkLs7CFUSzoOks5tilCTgaJpZM4SuS8j .

n01r commented 6 years ago

It seems to work now! No error messages anymore and no values in data/16600/particles/H_all/weighting are being read as zero anymore! :)

Also the bpdump log now shows

Var (Group) [ID]: //data/16600/particles/H_all/weighting (data) [237]
        Datatype: byte
        Vars Characteristics: 800
        Offset(1634343392)      Payload Offset(1634343561)      File Index(2)   Time Index(1)   Dims (l:g:o): (829)     Transform type: blosc compression (ID = 11)     Pre-transform datatype: real    Pre-transform dims (l:g:o): (184320:147455991:0)
        Offset(2273766762)      Payload Offset(2273766931)      File Index(2)   Time Index(1)   Dims (l:g:o): (829)     Transform type: blosc compression (ID = 11)     Pre-transform datatype: real    Pre-transform dims (l:g:o): (184320:147455991:184320)
....

bpdump_2_16600.log, pw: bpdump https://owncloud.hzdr.de/index.php/s/CMaOUtXpX9S8A3z

Thanks a bunch @pnorbert!

pnorbert commented 6 years ago

Finally! Let us know if there is more trouble. I am closing this issue and will open new ones for each compression that has a problem with zero blocks.

pnorbert commented 6 years ago

Note: Zero blocks now should work with all transformations.

ax3l commented 6 years ago

thank you so much!