ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

Segmentation fault (core dumped) #124

Closed physixfan closed 7 years ago

physixfan commented 7 years ago

Hi

I am really frustrated about the Segmentation fault (core dumped) error. It shows up when I try to read the .bp file produced from some large runs. Actually the file is not very large, just ~4GB. I can successfully read some larger files, so I don't think it is because of the limitation of memory. This is the code that I use (in python):

[xfan@gr-fe1 target_shape_46]$ python
Python 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import adios as ad
>>> binary_file = ad.file("record.bp")
Segmentation fault (core dumped)
pnorbert commented 7 years ago

Hi, can you bpls the content of the file? How about adios 1.11?

jychoi-hpc commented 7 years ago

Hi. If "bpls" works ok with your file, then, can you try to use gdb and share the results?

Save your command as a script (e.g, test_adios.py) and run the following command: $ gdb --args python test_adios.py

In gdb, type "run": (gdb) run

If you can share your backtraces, it will be helpful to identify the location.

Also, which version of Adios and Adios python version did you use? The following command will give the Adios python version:

import adios as ad ad.version

Thanks, Jong

On Mar 1, 2017, at 9:30 AM, pnorbert notifications@github.com wrote:

Hi, can you bpls the content of the file? How about adios 1.11?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

physixfan commented 7 years ago

bpls: [xfan@gr-fe1 target_shape_46]$ bpls -latv record.bp Segmentation fault (core dumped)

The bp file is produced by a code which uses adios 1.3.1, and my python code is using version 1.11.0. Do you think the difference of the versions have an effect on this as well? It's difficult to recompile my advisor's code with adios 1.11.0, while adios 1.3.1 does not support python...

pnorbert commented 7 years ago

We have some old record.bp and pixie3d.bp from 2010/11 and bpls works fine with them. Can you guys somehow upload your record.bp to ORNL so we can see what has changed?

On Wed, Mar 1, 2017 at 11:56 AM, physixfan notifications@github.com wrote:

bpls: [xfan@gr-fe1 target_shape_46]$ bpls -latv record.bp Segmentation fault (core dumped)

The bp file is produced by a code which uses adios 1.3.1, and my python code is using version 1.11.0. Do you think the difference of the versions have an effect on this as well? It's difficult to recompile my advisor's code with adios 1.11.0, while adios 1.3.1 does not support python...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283400144, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLReiJ95MNUkxF4sA9b6L2m4ebwscks5rhaNagaJpZM4MPRNZ .

physixfan commented 7 years ago

Well, not all the record.bp files fail, only some of them. And I still can't figure out when. I am working with a LANL server called Grizzly, it is a new machine so maybe that's the reason. I don't know how to upload the file to ORNL, and I am trying to upload it to Dropbox. It's just 4GB, so I don't think it will take long.

pnorbert commented 7 years ago

Is that a big endian or little endian machine?

On Mar 1, 2017 12:38 PM, "physixfan" notifications@github.com wrote:

Well, not all the record.bp files fail, only some of them and I still can't figure out when. I am working with a LANL server called Grizzly, it is a new machine so maybe that's the reason. I don't know how to upload the file to ORNL, and I am trying to upload it to Dropbox. It's just 4GB, so I don't think it will take long.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283412488, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLXNgZ_RCUtIqlnhsJdmtTlspwoedks5rha0fgaJpZM4MPRNZ .

physixfan commented 7 years ago

I would say it's big. I usually use 4096 processors to run the code.

pnorbert commented 7 years ago

I meant the byte-order of the machine, big endian or little endian. What is the CPU.

Anyway, if some files created on Grizzly work and some don't then this does not matter. Only if all files from Grizzly cause segfaults.

On Mar 1, 2017 1:07 PM, "physixfan" notifications@github.com wrote:

I would say it's big. I usually use 4096 processors to run the code.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283420496, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLQRyk85rGSLYndvyPhKUJHk7U_oRks5rhbPogaJpZM4MPRNZ .

physixfan commented 7 years ago

https://www.dropbox.com/s/yfbzhden7hlj98a/record.bp?dl=0 This is the dropbox link of the record.bp file. Thanks for your help here~!

pnorbert commented 7 years ago

Can you please check if you give a big enough size in the adios_group_size() call. It should be >= than how many bytes you actually write between adios_open() ... adios_close() in a process.

If that's the case, please tell me what the total_size value is (this is returned by adios_group_size().

On Wed, Mar 1, 2017 at 1:43 PM, physixfan notifications@github.com wrote:

https://www.dropbox.com/s/yfbzhden7hlj98a/record.bp?dl=0 This is the dropbox link of the record.bp file. Thanks for your help here~!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283430525, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLe1RW6y5DjvGcwrdOTvuqxeoJuaGks5rhbxMgaJpZM4MPRNZ .

physixfan commented 7 years ago

Thanks. Since reading the original file causes a segment fault, I copied and restarted this run. I asked for 32 processors to run. The relevant output is:

 rank=           0  ADIOS group_size=               206164
 rank=           0  ADIOS total size=               211574
  ADIOS file write...
r0 offset=(  0,  0,  0)
r0 size=( 33, 65,  3)
r0 i=  0: 32,  0: 64,  0:  2
  ADIOS file close...
 rank=           1  ADIOS group_size=               199736
 rank=           1  ADIOS total size=               205146
r1 offset=( 33,  0,  0)
r1 size=( 32, 65,  3)
r1 i=  1: 32,  0: 64,  0:  2
 rank=           2  ADIOS group_size=               199736
 rank=           2  ADIOS total size=               205146
r2 offset=( 65,  0,  0)
r2 size=( 32, 65,  3)
r2 i=  1: 32,  0: 64,  0:  2

(and some repetitive outputs)

And this is the relevant code in pixie2d:

      call adios_group_size(handle,groupsize,totalsize,err)

      if (err /= 0) then
        write (*,*) 'Problem in writeRecordFile'
        write (*,*) 'rank=',my_rank,'  ERROR in "adios_group_size"'
        stop
      endif

      if (adios_debug) then
        write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize
        write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize
      endif
pnorbert commented 7 years ago

The index in the file you gave me is corrupt. I could figure that you had 372736 blocks of data, that's 4096 processors times 91 output steps. But the pg_index that just enumerates those blocks becomes corrupt at the 237th block already.

I don't know whether the data is corrupt. 1.11's bprecover utility is probably not exactly working for this old version BP file so I cannot rely on its report.

So the question is why does your output file become corrupted on this machine? Can you try using the POSIX method instead of MPI to produce one file per process? I don't even remember how that works in adios 1.3.1...

BTW, could you read the data from your last rerun with 32 processes?

On Wed, Mar 1, 2017 at 7:41 PM, physixfan notifications@github.com wrote:

Thanks. Since reading the original file causes a segment fault, I copied and restarted this run. I asked for 32 processors to run. The relevant output is:

rank= 0 ADIOS group_size= 206164 rank= 0 ADIOS total size= 211574 ADIOS file write... r0 offset=( 0, 0, 0) r0 size=( 33, 65, 3) r0 i= 0: 32, 0: 64, 0: 2 ADIOS file close... rank= 1 ADIOS group_size= 199736 rank= 1 ADIOS total size= 205146 r1 offset=( 33, 0, 0) r1 size=( 32, 65, 3) r1 i= 1: 32, 0: 64, 0: 2 rank= 2 ADIOS group_size= 199736 rank= 2 ADIOS total size= 205146 r2 offset=( 65, 0, 0) r2 size=( 32, 65, 3) r2 i= 1: 32, 0: 64, 0: 2

(and some repetitive outputs)

And this is the relevant code in pixie2d:

  call adios_group_size(handle,groupsize,totalsize,err)

  if (err /= 0) then
    write (*,*) 'Problem in writeRecordFile'
    write (*,*) 'rank=',my_rank,'  ERROR in "adios_group_size"'
    stop
  endif

  if (adios_debug) then
    write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize
    write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize
  endif

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283519236, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLfNZyywjGMCLrYqGzZfvoJnMBp-fks5rhhBEgaJpZM4MPRNZ .

pnorbert commented 7 years ago

Can you send me the output of bprecover ran on your 32 process run? I am curious what numbers we see there.

On Wed, Mar 1, 2017 at 9:00 PM, Norbert Podhorszki < norbert.podhorszki@gmail.com> wrote:

The index in the file you gave me is corrupt. I could figure that you had 372736 blocks of data, that's 4096 processors times 91 output steps. But the pg_index that just enumerates those blocks becomes corrupt at the 237th block already.

I don't know whether the data is corrupt. 1.11's bprecover utility is probably not exactly working for this old version BP file so I cannot rely on its report.

So the question is why does your output file become corrupted on this machine? Can you try using the POSIX method instead of MPI to produce one file per process? I don't even remember how that works in adios 1.3.1...

BTW, could you read the data from your last rerun with 32 processes?

On Wed, Mar 1, 2017 at 7:41 PM, physixfan notifications@github.com wrote:

Thanks. Since reading the original file causes a segment fault, I copied and restarted this run. I asked for 32 processors to run. The relevant output is:

rank= 0 ADIOS group_size= 206164 rank= 0 ADIOS total size= 211574 ADIOS file write... r0 offset=( 0, 0, 0) r0 size=( 33, 65, 3) r0 i= 0: 32, 0: 64, 0: 2 ADIOS file close... rank= 1 ADIOS group_size= 199736 rank= 1 ADIOS total size= 205146 r1 offset=( 33, 0, 0) r1 size=( 32, 65, 3) r1 i= 1: 32, 0: 64, 0: 2 rank= 2 ADIOS group_size= 199736 rank= 2 ADIOS total size= 205146 r2 offset=( 65, 0, 0) r2 size=( 32, 65, 3) r2 i= 1: 32, 0: 64, 0: 2

(and some repetitive outputs)

And this is the relevant code in pixie2d:

  call adios_group_size(handle,groupsize,totalsize,err)

  if (err /= 0) then
    write (*,*) 'Problem in writeRecordFile'
    write (*,*) 'rank=',my_rank,'  ERROR in "adios_group_size"'
    stop
  endif

  if (adios_debug) then
    write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize
    write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize
  endif

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283519236, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLfNZyywjGMCLrYqGzZfvoJnMBp-fks5rhhBEgaJpZM4MPRNZ .

physixfan commented 7 years ago

I do not save the bp file for the 32 process run. But I made a new run with the same input file with 4096 processors and 30 minutes. There's not much data there yet, and the file can be read correctly. Here's the link: https://www.dropbox.com/s/z21y5s9worh6t56/record2.bp?dl=0

pnorbert commented 7 years ago

Although I can still read data with bpls from this file, 1.3.1 version of bpdump already dies on it. 1.11.0 version of bprecover tells me the same strange sizes I saw with the big one.

So what is the adios_group_size() report on this run? Is the total size 27050 for all processes, or is it 8054 for the first process and 7386 for the rest?

On Wed, Mar 1, 2017 at 10:26 PM, physixfan notifications@github.com wrote:

I do not save the bp file for the 32 process run. But I made a new run with the same input file with 4096 processors and 30 minutes. There's not much data there yet, and the file can be read correctly. Here's the link: https://www.dropbox.com/s/z21y5s9worh6t56/record2.bp?dl=0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283545386, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLbRTEr1AEAAPPt9VC4BDYZ5JYDJkks5rhjbigaJpZM4MPRNZ .

physixfan commented 7 years ago

The latter:

r* offset=( 49,  0,  0)
r* size=(  4,  5,  3)
r* i=  1:  4,  0:  4,  0:  2
 rank=           0  ADIOS group_size=                 2644
 rank=           0  ADIOS total size=                 8054
  ADIOS file write...
r0 offset=(  0,  0,  0)
r0 size=(  5,  5,  3)
r0 i=  0:  4,  0:  4,  0:  2
  ADIOS file close...
 rank=           1  ADIOS group_size=                 1976
 rank=           1  ADIOS total size=                 7386
r1 offset=(  5,  0,  0)
r1 size=(  4,  5,  3)
r1 i=  1:  4,  0:  4,  0:  2
rank= 1 ihip=5 ilom=0 jhip=5 jlom=0 khip=2 klom=0
 rank=           1  ADIOS open: handle=             59689568
 rank=           3  ADIOS group_size=                 1976
 rank=           3  ADIOS total size=                 7386
r3 offset=( 13,  0,  0)
r3 size=(  4,  5,  3)
r3 i=  1:  4,  0:  4,  0:  2
 rank=           2  ADIOS group_size=                 1976
 rank=           2  ADIOS total size=                 7386
physixfan commented 7 years ago

So what are the implications from these? How can I fix the problem?

pnorbert commented 7 years ago

I think the only implication so far is that bprecover from 1.11 is not working at all on this old file (was not intended to do that). So we have no idea what goes wrong during the writes, and therefore we/you cannot fix it.

We suggested to Luis to come to ORNL and work together to add adios 1.11 to Pixie3D.

Do you see these segfaults depending on the size of run (maybe the per-process variables are too small and cause some problem in the index (it shouldn't of course)), or depending on the number of timesteps?

Maybe you could do one more run using the POSIX method instead of the MPI method and see if you can read the global data, or at least the individual files, or if you have the same segfaults.

On Thu, Mar 2, 2017 at 7:03 PM, physixfan notifications@github.com wrote:

So what are the implications from these? How can I fix the problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283822985, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRRB-uzbznGQNd2dWqbnL1ttLFAaks5rh1jpgaJpZM4MPRNZ .

pnorbert commented 7 years ago

BTW, I could not even compile 1.3.1 with recent compilers on my Mac without fixing a line here and there. As far as I know, Luis has been using a modified version of 1.3.1 where he fixed some bugs himself but I don't know more details about those bugs. But I wonder how that version works on new machines with new compilers.

On Thu, Mar 2, 2017 at 7:20 PM, Norbert Podhorszki < norbert.podhorszki@gmail.com> wrote:

I think the only implication so far is that bprecover from 1.11 is not working at all on this old file (was not intended to do that). So we have no idea what goes wrong during the writes, and therefore we/you cannot fix it.

We suggested to Luis to come to ORNL and work together to add adios 1.11 to Pixie3D.

Do you see these segfaults depending on the size of run (maybe the per-process variables are too small and cause some problem in the index (it shouldn't of course)), or depending on the number of timesteps?

Maybe you could do one more run using the POSIX method instead of the MPI method and see if you can read the global data, or at least the individual files, or if you have the same segfaults.

On Thu, Mar 2, 2017 at 7:03 PM, physixfan notifications@github.com wrote:

So what are the implications from these? How can I fix the problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-283822985, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRRB-uzbznGQNd2dWqbnL1ttLFAaks5rh1jpgaJpZM4MPRNZ .

physixfan commented 7 years ago

The segment fault is depending on the number of time steps as far as I observed. But since you mentioned even for the small run the bp file is broken, I don't know now... I'll contact Luis about this issue and see whether he can update the version of ADIOS.

physixfan commented 7 years ago

Hi!

Luis has re-compiled pixie2d with ADIOS version 1.10. However, I still see the segmentation fault error. The following link is a record.bp file produced by the new code, please check whether you can find what caused these errors. Thanks!

https://www.dropbox.com/s/qpk669c1lkh4f6j/record_Mar6.bp?dl=0

physixfan commented 7 years ago

Hi, do you have time to take a look? Thanks!

pnorbert commented 7 years ago

I was on travel but now I looked at it. bprecover took a long time to process this file but it found that in the last step the blocks written by the processes (PGs, or Process Groups) were corrupt in the file. So something went wrong at that output step. Did you get any adios or system error messages in your application logs?

116 steps could be recovered, although it took a very long time to do that. The metadata is 460MB at that point which is far from being a problem for adios, unless some compute nodes run out of memory. Did your application die at this output step? What was the reason for it?

Look for a Process Group (PG) at offset 1386173555 PG reported size: 2892 Check if it looks like a PG... Fortran flag char should be 'y' or 'n': 'y' Group name length expected to be less than 64 characters: 6 Group name: "record" PG 475280 found at offset 1386173555 Group Name: record Host Language Fortran?: Y Coordination Var Member ID: 0 Time Name: tidx Time: 117 Methods used in output: 1 Method ID: 0 Method Parameters: Vars Count: 14 Var Name (ID): nvar (5) Var Name (ID): nxd+2 (7) Var Name (ID): nyd+2 (9) Var Name (ID): nzd+2 (11) Var Name (ID): xoffset (13) Var Name (ID): yoffset (14) Var Name (ID): zoffset (15) Var Name (ID): xsize (16) Var Name (ID): ysize (17) Var Name (ID): zsize (18) Var Name (ID): v1 (21) Var Name (ID): v2 (22) Var Name (ID): v3 (23) Var Name (ID): v4 (24) Attributes Count: 0 Actual size of group by processing: 2892 bytes

Look for a Process Group (PG) at offset 1386176447 PG reported size: 1258272568115204 === Offset + PG reported size >> file size. This is not a (good) PG.

Found 475280 PGs to be processable Index metadata size is 460579051 bytes Ready to write index to file offset 1386176447. Size is 460579051 bytes Wrote index to file offset 1386176447. Size is 460579051 bytes Truncate file to size 1846755498

On Mon, Mar 13, 2017 at 2:23 PM, physixfan notifications@github.com wrote:

Hi, do you have time to take a look? Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/124#issuecomment-286198492, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLVJh_7nGq8M1ym-l4wnRmfRObd_Yks5rlYmzgaJpZM4MPRNZ .

physixfan commented 7 years ago

Thank you very much! Luis has solved this issue.