Open davidslac opened 7 years ago
I changed the code to just print an error when the check is bad, rather than throw an exception. I'm being more careful with the memory I read into, I even added a guard check when calling the high level H5LDget_dset_dims
. I get about 15-26% of these errors when going through the fiducials. I think I will have to make a small example, see if it is a Hdf5 bug?
I did find a bug in a small example, here is the bug report:
Hi David,
I can reproduce the error, and entered bug HDFFV-10176 for the issue.
Thanks!
-Barbara
========================================================
Barbara Jones, The HDF Group Helpdesk, help@hdfgroup.org
Support Services: https://www.hdfgroup.org/support/
========================================================
On Wed, 19 Apr 2017, David A. Schneider wrote:
> VERSION:
> HDF5-1.10.0 patch1
> HDF5-1.10.1-pre1
>
> USER:
> David Schneider davidsch@slac.stanford.edu
>
> SYNOPSIS:
> Using H5Pset_virtual_view with the H5D_VDS_FIRST_MISSING flag does not work through SWMR
>
> MACHINE / OPERATING SYSTEM:
> Linux psanaphi104 3.19.8-1.el7.x86_64 #1 SMP Wed Aug 3 18:10:48 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> COMPILER:
> g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
>
>
> DESCRIPTION:
> I have created a small example with three writers, the first writes 0,3,6,...
> the second 1,4,7,... and the third 2,5,8,...
> than a master process makes a VDS dataset that is a round robin, 0,1,2,3,4,...
> then two readers read from this using SWMR, the first reads even entries, the second odd.
>
> The readers use the H5P_set_virtual_view to specify first missing, meaning they should
> never see a missing value, however they do.
>
>
> REPEAT BUG BY:
>
> tar xfv vds_swmr_missing.tar.gz
> cd vds_swmr_check_bug
> # edit driver so the LOC points to a directory on a filesystem that supports SWMR
> ./driver
>
> (the tar is also in the tag: https://github.com/slaclab/lc2-hdf5-110/tree/vds_first_missing_not_working_v2, in the questions/vds_swmr_check_bug directory)
> if you get messages like
>
> ERROR reader:0 read 0xaabbccddeeff != 0x450 for entry 1104 of vds
>
> then you are reading missing values.
> Other details, I build hdf5 1.10.0 like so, against openmpi 1.10.6:
>
> ./configure --prefix=$PREFIX \
> --enable-build-mode=production \
> --with-szlib=$PREFIX \
> --enable-threadsafe \
> --enable-unsupported \
> --enable-cxx \
> --with-default-api-version=v18 \
> --enable-parallel
>
> while the 1.10.1 pre1 is currently a debug build, also against openmpi 1.10.6
>
>
> ./configure --prefix=$PREFIX \
> --with-szlib=$PREFIX \
> --enable-threadsafe \
> --enable-unsupported \
> --enable-cxx \
> --enable-build-mode=debug \
> --enable-trace \
> --enable-parallel
>
>
> Here is some output when I do
>
> ./driver 2>&1 | less
>
> + LOC=/reg/d/ana01/temp/davidsch/lc2/runA
> + '[' '!' -d /reg/d/ana01/temp/davidsch/lc2/runA ']'
> + rm /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_0.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_1.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_2.h5
> + h5c++ -Wall writer.cpp -o writer
> + h5c++ -Wall master.cpp -o master
> + h5c++ -Wall reader.cpp -o reader
> + ./writer 0 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_0.h5
> + ./writer 1 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_1.h5
> + ./master /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_0.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_1.h5 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_2.h5
> + ./writer 2 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_writer_2.h5
> master done
> + ./reader 0 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5
> + ./reader 1 /reg/d/ana01/temp/davidsch/lc2/runA/vds_swmr_master.h5
> writer 0 done
> writer 1 done
> writer 2 done
> ERROR reader:0 read 0xaabbccddeeff != 0x38a for entry 906 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x390 for entry 912 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x396 for entry 918 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x39c for entry 924 of vds
> ERROR reader:0 read 0xaabbccddeeff != 0x3a2 for entry 930 of vds
>
> I've also tried making the writers sleep before they exit, but I still get the errors.
>
then I found that if I keep that master in that small example alive (this is in this tag and subdirectory:)
that it worked. So I have just checked in code for the big program where the master does something similar - stays around while the writers are writing. However I still get the error. To see this, with this version of the big code, one does
./launch_local
that runs each piece, recording their output in logfiles, and then the python script sort_logs goes through and reads the milli since the epoch and sorts them all into log.out. The log.out I ran is checked in, and we see we have a fiducial error, search for ERROR in
but this happens before the daq_master is done.
I should remove that 18MB file from the github.
Right now, I have three daq_writers, each contributes to a round robin of a 'fiducials' dataset. There is one master creating a VDS from those There are two readers, each reading from that master.
I'm getting a garbage read, it is always for a event like 301 or 501 or 701. That only seemed to correlate with the flush cycle for the writers - they were flushing their datasets on an interval of 100, but then I tried 71 and got the check violation around 501. The chuck size for the datasets varies, it was 600, but I've tried 10.
If I put a H5Drefresh in right before all the reads, it runs a lot slower but I don't see the corruption.
My debugging output looks like
the number is milliseconds since the epoch.
Here's another one where the flush interval is 71, chunk size is 10. I see the problem in both 1.10.1 pre1 debug build, and 1.10.0 production build (below is 1.10.0 prod)
For that one, we can see previous flushes and the write of this fiducial 501, about 200 milli seconds earlier:
I'm going to have to try to make a small example to reproduce? Maybe a bug in my code? Or maybe Hdf5?