project8 / psyllid

Data acquisition package for the ROACH2 system
Other
0 stars 1 forks source link

Psyllid crash on Zeppelin: "pure virtual method called" in HDF5 somewhere #35

Open nsoblath opened 7 years ago

nsoblath commented 7 years ago
pure virtual method called
terminate called without an active exception
HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 139886732498688:
  #000: ../../../src/H5D.c line 993 in H5Dset_extent(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
19:08:19 [ERROR] /core/diptera.cc(292): non-node exception thrown: HDF5 error while writing a record:
    H5Dset_extent failed (function: DataSet::extend)
19:08:19 [ERROR] _receiver_fpa.cc(244): Exiting due to stream error
19:08:19 [ERROR] l/daq_control.cc(182): An unknown exception was thrown from midge: HDF5 error while writing a record:
    H5Dset_extent failed (function: DataSet::extend)
19:08:19 [ERROR] l/daq_control.cc(200): Canceling due to midge error
HDF5: infinite loop closing library
      D,G,A,S,T,F,FD,P,FD,P,FD,P,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,
E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E
buzinsky commented 7 years ago

psyllid_ch_a-stderr---supervisor-INAD9P.txt

buzinsky commented 7 years ago

psyllid_interface-stderr---supervisor-LUb12D.txt

r2_ch_a-stderr---supervisor-nIcGyZ.txt

psyllid_ch_a-stdout---supervisor-hfyasA.txt

ldeviveiros commented 7 years ago

It happened again on Monday, 1:53pm EST. Recorded error message in the elog (not copying it here because it's the same thing, no need to keep repeating it)

nsoblath commented 7 years ago

I saw it suggested that one potential cause of the "pure virtual method called" error is if a virtual method is called from a destructor, it can sometimes be that the function that's supposed to be called has already been deleted before it's called.

I checked through the classes in the control and daq libraries, in midge/core, and in monarch3, and I didn't find any suspicious destructors.

ldeviveiros commented 7 years ago

It happened again on Monday, 1:53pm EST. Recorded error message in the elog https://maxwell.npl.washington.edu/elog/project8/Project+8/1683 (not copying it here because it's the same thing, no need to keep repeating it)

nsoblath commented 7 years ago

Here's what I know so far:

The infinite loop closing library issue is probably the result of HDF5 commands being called after the global HDF5 cleanup has already been called. This is a secondary problem, caused by something else going wrong.

The root cause, I believe, is described in this section:

HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 139886732498688:
  #000: ../../../src/H5D.c line 993 in H5Dset_extent(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
19:08:19 [ERROR] /core/diptera.cc(292): non-node exception thrown: HDF5 error while writing a record:
    H5Dset_extent failed (function: DataSet::extend)

In the C++ library, DataSet::extend() calls function H5Dset_extent() in the C library. The latter function has an error here in this bit of code:

    if(NULL == (dset = (H5D_t *)H5I_object_verify(dset_id, H5I_DATASET)))
    HGOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a dataset")

This is checking whether one of the arguments, dset_id, which should be the ID number of the dataset, is in fact a dataset. The dataset ID comes from member variable id of the DataSet C++ object.

The only place that DataSet::extend() is called in Monarch is in M3Stream.cc, at line 453, in function M3Stream::WriteRecord(). The extend function is called on fH5CurrentAcqDataSet, which is a pointer to a DataSet object. I assume the pointer is valid, because if not we would have a segfault instead of the crash that we have. Perhaps the DataSet object isn't initialized correctly. In the default constructor it's initialized to 0.

I've added some diagnostic printing to the exception catching in M3Stream::WriteRecord() (starting at line 468):

            LWARN( mlog, "DIAGNOSTIC: id of fH5CurrentAcqDataSet: " << fH5CurrentAcqDataSet->getId() );
            LWARN( mlog, "DIAGNOSTIC: class name: " << fH5CurrentAcqDataSet->fromClass() );
            H5D_space_status_t t_status;
            fH5CurrentAcqDataSet->getSpaceStatus( t_status );
            LWARN( mlog, "DIAGNOSTIC: offset: " << fH5CurrentAcqDataSet->getOffset() << "  space status: " << t_status << "  storage size: " << fH5CurrentAcqDataSet->getStorageSize() << "  in mem data size: " << fH5CurrentAcqDataSet->getInMemDataSize() );

These should tell us how the DataSet object is configured, to some extent.

For the record, during writing, fH5CurrentAcqDataSet is initialized on line 447:

                fH5CurrentAcqDataSet = new H5::DataSet( fH5AcqLoc->createDataSet( fAcqNameBuffer, fDataTypeInFile, H5::DataSpace( N_DATA_DIMS, fStrDataDims, fStrMaxDataDims ), tPropList ) );