VDS + SWMR, Reading bad values

davidslac commented 7 years ago

Right now, I have three daq_writers, each contributes to a round robin of a 'fiducials' dataset. There is one master creating a VDS from those There are two readers, each reading from that master.

I'm getting a garbage read, it is always for a event like 301 or 501 or 701. That only seemed to correlate with the flush cycle for the writers - they were flushing their datasets on an interval of 100, but then I tried 71 and got the check violation around 501. The chuck size for the datasets varies, it was 600, but I've tried 10.

If I put a H5Drefresh in right before all the reads, it runs a lot slower but I don't see the corruption.

My debugging output looks like

1492471235827 ana_reader_master-s0001:0 starting to process 901
1492471235827 ana_reader_master-s0001:1 dset=/cspad/00000/fiducials &m_dims=0x14c2b00 m_dims=1000, space_dims=1000,filespace:  start=901, count=1, stride=1, block=1,
1492471235827 ana_reader_master-s0001:2 memspace:  dims=1,
1492471235827 ana_reader_master-s0001:3 data (before): 0x0
1492471235827 ana_reader_master-s0001:4 data (after): 0x7fffd5c453a8
1492471235827 ana_reader_master-s0001:5 check_event_number failure: cspad/0/fiducials value=140736779801512 != event_number=901

the number is milliseconds since the epoch.

Here's another one where the flush interval is 71, chunk size is 10. I see the problem in both 1.10.1 pre1 debug build, and 1.10.0 production build (below is 1.10.0 prod)

1492471352240 ana_reader_master-s0001:0 starting to process 501
1492471352240 ana_reader_master-s0001:1 dset=/cspad/00000/fiducials &m_dims=0x1155b10 m_dims=639, space_dims=639,filespace:  start=501, count=1, stride=1, block=1,
1492471352240 ana_reader_master-s0001:2 memspace:  dims=1,
1492471352240 ana_reader_master-s0001:3 data (before): 0x0
1492471352240 ana_reader_master-s0001:4 data (after): 0x7ffee863b5d8
1492471352240 ana_reader_master-s0001:5 check_event_number failure: cspad/0/fiducials value=140732797269464 != event_number=501

For that one, we can see previous flushes and the write of this fiducial 501, about 200 milli seconds earlier:

1492471351883 daq_writer-s0000:7 cspad fiducial=501
1492471352043 daq_writer-s0001:0 flush_data: fiducial=639 last_cspad_written:637
1492471352092 daq_writer-s0000:0 flush_data: fiducial=568 last_cspad_written:567

I'm going to have to try to make a small example to reproduce? Maybe a bug in my code? Or maybe Hdf5?

davidslac commented 7 years ago

I changed the code to just print an error when the check is bad, rather than throw an exception. I'm being more careful with the memory I read into, I even added a guard check when calling the high level H5LDget_dset_dims. I get about 15-26% of these errors when going through the fiducials. I think I will have to make a small example, see if it is a Hdf5 bug?

davidslac commented 7 years ago

I did find a bug in a small example, here is the bug report:

Hi David,

I can reproduce the error, and entered bug HDFFV-10176 for the issue.

Thanks!
-Barbara

========================================================
Barbara Jones, The HDF Group Helpdesk, help@hdfgroup.org
Support Services:  https://www.hdfgroup.org/support/
========================================================

On Wed, 19 Apr 2017, David A. Schneider wrote:

> VERSION:
>   HDF5-1.10.0 patch1
>   HDF5-1.10.1-pre1
>
> USER:
>   David Schneider davidsch@slac.stanford.edu
>
> SYNOPSIS:
>   Using H5Pset_virtual_view with the H5D_VDS_FIRST_MISSING flag does not work through SWMR
>
> MACHINE / OPERATING SYSTEM:
> Linux psanaphi104 3.19.8-1.el7.x86_64 #1 SMP Wed Aug 3 18:10:48 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> COMPILER:
> g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
>
>
> DESCRIPTION:
>   I have created a small example with three writers, the first writes 0,3,6,...
>   the second 1,4,7,... and the third 2,5,8,...
>   than a master process makes a VDS dataset that is a round robin, 0,1,2,3,4,...
>   then two readers read from this using SWMR, the first reads even entries, the second odd.
>
>   The readers use the H5P_set_virtual_view to specify first missing, meaning they should
>   never see a missing value, however they do.
>
>
> REPEAT BUG BY:
>
> tar xfv vds_swmr_missing.tar.gz
> cd vds_swmr_check_bug
> # edit driver so the LOC points to a directory on a filesystem that supports SWMR
> ./driver
>
> (the tar is also in the tag: https://github.com/slaclab/lc2-hdf5-110/tree/vds_first_missing_not_working_v2, in the questions/vds_swmr_check_bug directory)
> if you get messages like
>
> ERROR reader:0 read 0xaabbccddeeff != 0x450 for entry 1104 of vds
>
> then you are reading missing values.
>  Other details, I build hdf5 1.10.0 like so, against openmpi 1.10.6:
>
> ./configure --prefix=$PREFIX \
>    --enable-build-mode=production \
>    --with-szlib=$PREFIX \
>    --enable-threadsafe \
>    --enable-unsupported \
>    --enable-cxx \
>    --with-default-api-version=v18 \
>    --enable-parallel
>
> while the 1.10.1 pre1 is currently a debug build, also against openmpi 1.10.6
>
>
> ./configure --prefix=$PREFIX \
>    --with-szlib=$PREFIX \
>    --enable-threadsafe \
>    --enable-unsupported \
>    --enable-cxx \
>    --enable-build-mode=debug \
>    --enable-trace \
>    --enable-parallel
>
>
> Here is some output when I do
>
> ./driver 2>&1 | less
>
> + LOC=/reg/d/ana01/temp/davidsch/lc2/runA
> + '[' '!' -d /reg/d/ana01/temp/davidsch/lc2/runA ']'
> + rm /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_0.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_1.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_2.h5
> + h5c++ -Wall writer.cpp -o writer
> + h5c++ -Wall master.cpp -o master
> + h5c++ -Wall reader.cpp -o reader
> + ./writer 0 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_0.h5
> + ./writer 1 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_1.h5
> + ./master /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_0.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_1.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_2.h5
> + ./writer 2 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_2.h5
> master done
> + ./reader 0 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5
> + ./reader 1 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5
> writer 0 done
> writer 1 done
> writer 2 done
> ERROR reader:0 read 0xaabbccddeeff != 0x38a for entry 906 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x390 for entry 912 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x396 for entry 918 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x39c for entry 924 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x3a2 for entry 930 of vds
>
> I've also tried making the writers sleep before they exit, but I still get the errors.
>

then I found that if I keep that master in that small example alive (this is in this tag and subdirectory:)

https://github.com/slaclab/lc2-hdf5-110/tree/vds_first_missing_not_working_v2/questions/vds_swmr_check_bug

that it worked. So I have just checked in code for the big program where the master does something similar - stays around while the writers are writing. However I still get the error. To see this, with this version of the big code, one does

./launch_local

that runs each piece, recording their output in logfiles, and then the python script sort_logs goes through and reads the milli since the epoch and sorts them all into log.out. The log.out I ran is checked in, and we see we have a fiducial error, search for ERROR in

https://raw.githubusercontent.com/slaclab/lc2-hdf5-110/7b69d5aa51992b0800d7043cfe3ea91ae400d6e2/log.out

but this happens before the daq_master is done.

I should remove that 18MB file from the github.

slaclab / lc2-hdf5-110

VDS + SWMR, Reading bad values #13