gwdetchar / hveto

A python implementation of the Hierarchichal Veto (hveto) algorithm
GNU General Public License v3.0
5 stars 11 forks source link

find_max_significance function is truncating channel names #53

Closed tjma12 closed 8 years ago

tjma12 commented 8 years ago

Auxiliary channel names are still being sliced in my Hveto run which leads to a KeyError when you try to refer to the auxiliary channel dictionary.

In the find_max_significance function: https://github.com/hveto/hveto/blob/master/hveto/core.py#L163

The line, rec = stack_arrays([primary] + auxiliary.values(), usemask=False,asrecarray=True, autoconvert=True) seems to be converting the name of each auxiliary channel to have the same number of characters as the primary channel.

Example output:

In [15]: rec
Out[15]: 
rec.array([ (1159777938.4375, 64.82205200195312, 6.263309955596924, u'L1:GDS-CALIB_STRAIN'),
 (1159778139.128906, 503.4388427734375, 6.23298978805542, u'L1:GDS-CALIB_STRAIN'),
 (1159778141.105468, 38.02714920043945, 6.714660167694092, u'L1:GDS-CALIB_STRAIN'),
 ..., (1159799329.0439453, 100.0, 10.0, u'L1:SUS-PR2_M3_MASTE'),
 (1159799333.4711914, 100.0, 10.0, u'L1:SUS-PR2_M3_MASTE'),
 (1159799333.6376953, 100.0, 10.0, u'L1:SUS-PR2_M3_MASTE')], 
          dtype=[('time', '<f8'), ('frequency', '<f4'), ('snr', '<f4'), ('channel', '<U19')])

If you flip the order, the truncation doesn't happen:

In [51]: rec2 = stack_arrays(auxiliary.values() + [primary], usemask=False, asrecarray=True, autoconvert=True)

In [52]: rec2
Out[52]: 
rec.array([ (1159777920.553711, 100.0, 10.0, 'L1:SUS-ETMY_L1_MASTER_OUT_UR_DQ_0_DAC'),
 (1159777924.4033203, 100.0, 10.0, 'L1:SUS-ETMY_L1_MASTER_OUT_UR_DQ_0_DAC'),
 (1159777927.3798828, 100.0, 10.0, 'L1:SUS-ETMY_L1_MASTER_OUT_UR_DQ_0_DAC'),
 ...,
 (1159799317.269531, 38.02714920043945, 6.262800216674805, 'L1:GDS-CALIB_STRAIN'),
 (1159799319.789062, 504.7231750488281, 6.5116801261901855, 'L1:GDS-CALIB_STRAIN'),
 (1159799324.429687, 504.7231750488281, 12.190750122070312, 'L1:GDS-CALIB_STRAIN')], 
          dtype=[('time', '<f8'), ('frequency', '<f4'), ('snr', '<f4'), ('channel', 'S42')])

This still seems like a failure mode even if the order is flipped, a short auxiliary channel name would truncate the name of the primary channel.

duncanmmacleod commented 8 years ago

@tjma12, please post the original traceback from the call to hveto.

tjma12 commented 8 years ago
[hveto 1160152516]     INFO: All aux events loaded
[hveto 1160152516]     INFO: Recorded list of valid auxiliary channels in L1-HVETO_CHANNEL_LIST-1159777561-21774.txt
[hveto 1160152516]     INFO: -- Processing round 1 --
Traceback (most recent call last):
  File "/home/detchar/opt/gwpysoft-2.7/bin/hveto", line 405, in <module>
    primary, auxiliary, pchannel, snrs, windows, round.livetime)
  File "/home/detchar/opt/gwpysoft-2.7/lib/python2.7/site-packages/hveto/core.py", line 172, in find_max_significance
    dt / livetime)
KeyError: u'L1:SUS-SR2_M2_MASTE'
duncanmmacleod commented 8 years ago

@tjma12, also, please post the full command line used, so I can try and reproduce the error.

tjma12 commented 8 years ago

Oops, sorry. Here's the command line, files on LDAS LLO:

hveto -f /home/tjmassin/public_html/hveto/DAC/L1-DAC-1159777561-21774-hoft/DACconfig.ini -i L1 -a /home/tjmassin/public_html/hveto/DAC/L1-DAC-1159777561-21774-hoft/cache.lcf 1159777561 1159799335
tjma12 commented 8 years ago

I've looked into this a little bit more, the fix is to make sure that auxiliary channel names are stored as unicode strings so that they are compatible when compared to the primary channel name.

When the aux channel triggers are read from a cache file, they are cast into a recarray using the gwpy.table.lsctables.to_recarray method. If you don't provide a format, the format is inferred from the input content, in this case a channel name. The auxiliary channel names are being stored in the recarrays as fixed-length strings:

'L1:SUS-TMSY_M1_MASTER_OUT_SD_DQ_65536_DAC': rec.array([], 
           dtype=[('time', '<f8'), ('frequency', '<f4'), ('snr', '<f4'), ('channel', 'S41')])

The primary channel name is stored as a unicode string:

rec.array([ (1159777938.4375, 64.82205200195312, 6.263309955596924, u'L1:GDS-CALIB_STRAIN'),
 (1159778139.128906, 503.4388427734375, 6.23298978805542, u'L1:GDS-CALIB_STRAIN'),
 ...,
 (1159799324.429687, 504.7231750488281, 12.190750122070312, u'L1:GDS-CALIB_STRAIN')], 
          dtype=[('time', '<f8'), ('frequency', '<f4'), ('snr', '<f4'), ('channel', '<U19')])

When you combine these to calculate significance using numpy.lib.recfunctions.stack_arrays, the columns have incompatible data types, input arrays are cast to fit the same data types, as shown in my earlier comment. If you cast them both to the same dtype, in this case a unicode string, the casting works as intended:

In [50]: rec
Out[50]: 
rec.array([ (1159777938.4375, 64.82205200195312, 6.263309955596924, u'L1:GDS-CALIB_STRAIN'),
 (1159778139.128906, 503.4388427734375, 6.23298978805542, u'L1:GDS-CALIB_STRAIN'),
 ...,
 (1159798968.0429688, 100.0, 10.0, u'L1:SUS-TMSX_M1_MASTER_OUT_F2_DQ_0_DAC')], 
          dtype=[('time', '<f8'), ('frequency', '<f4'), ('snr', '<f4'), ('channel', '<U41')])