Ascan macro freeze when 0D returns a non valid value

izadworny commented 7 years ago

There was a problem with ascan macro on out beamline. It froze unexpectedly. This is ascan output. Ascan was started at 22:49:34, should have 50 measurements and end after about 10 min but "Ctrl-C" was pressed after few hours.

[    43] --> Preparing macro: ascan bds2 75 96 50 10
[    43] --> Running macro: ascan bds2 75 96 50 10

Operation will be saved in /mnt/control/data/bl_170530.dat (Spec)
Scan #108 started at Tue May 30 22:49:34 2017. It will take at least 0:09:25.391224
Moving to start positions...
#Pt No     BDS2      ct02     IMONV1    IMONV2    IMONV3    IMONV4   grpitch-enc  plmpitch-enc   M1TEMP      SR        dt   
   0         75        10     2124.88   1.59506e+07  -0.277471  0.263792  -1.52789e+06  -1.52789e+06    24.3    7.62856   13.5334 
   1       75.42       10     2124.14   1.76417e+07  -0.372828  0.165721  -1.52789e+06  -1.52789e+06    24.2    7.62812   24.6685 
   2       75.84       10     2119.13   1.77969e+07  -0.303234  0.260945  -1.52789e+06  -1.52789e+06    24.2     7.629    35.7365 
   3       76.26       10     2124.75   1.78848e+07  -0.273362  0.185585  -1.52789e+06  -1.52789e+06  24.2396   7.62866   47.0307 
   4       76.68       10     2125.04   1.8e+07   -0.345471  0.20531   -1.52789e+06  -1.52789e+06  24.2594   7.62778   58.1312 
   5        77.1       10     2125.06   1.81269e+07  -0.526012  0.158448  -1.52789e+06  -1.52789e+06  24.2579   7.62779    69.237 
   6       77.52       10     2122.48   1.82874e+07  -0.218547  0.289402  -1.52789e+06  -1.52789e+06    24.2    7.62709   80.3417 
   7       77.94       10     2124.22   1.83788e+07  -0.198406  0.269439  -1.52789e+06  -1.52789e+06  24.2296   7.62696   91.4397 
   8       78.36       10     2123.63   1.84869e+07  -0.233241  0.179774  -1.52789e+06  -1.52789e+06  24.2307   7.62563   102.531 
   9       78.78       10     2119.56   1.85516e+07  -0.378964  0.178865  -1.52789e+06  -1.52789e+06  24.2265   7.62482    113.62 
   10       79.2       10     2124.42   1.86319e+07  -0.306193  0.150186  -1.52789e+06  -1.52789e+06   24.201   7.62561   124.688 
   11      79.62       10     2119.43   1.86729e+07  -0.381318  0.24647   -1.52789e+06  -1.52789e+06  24.2194   7.62505   135.742 
   12      80.04       10     2118.82   1.87215e+07  -0.239907  0.21581   -1.52789e+06  -1.52789e+06  24.2306   7.62395   146.897 
   13      80.46       10     2121.63   1.87222e+07  -0.310567  0.205887  -1.52789e+06  -1.52789e+06  24.2596   7.62498   158.005 
   14      80.88       10     2122.36   1.87432e+07  -0.348965  0.0529502  -1.52789e+06  -1.52789e+06    24.2    7.62436    169.09 
   15       81.3       10     2128.03   1.87739e+07  -0.380928  0.0616244  -1.52789e+06  -1.52789e+06  24.2309   7.62389   180.179 
   16      81.72       10     2124.48   1.8848e+07  -0.28046  0.0990995  -1.52789e+06  -1.52789e+06    24.2    7.62337   191.253 
   17      82.14       10     2128.27   1.88964e+07  -0.367267  0.135994  -1.52789e+06  -1.52789e+06   24.261   7.62272   202.349 
   18      82.56       10     2119.62   1.89473e+07  -0.321498  0.169024  -1.52789e+06  -1.52789e+06  24.2046   7.62168   213.551 
   19      82.98       10     2123.92   1.89906e+07  -0.36326  0.164495  -1.52789e+06  -1.52789e+06  24.2133   7.62255   224.679 
   20       83.4       10     2125.82   1.90338e+07  -0.355728  0.203936  -1.52789e+06  -1.52789e+06    24.2    7.62249    235.75 
   21      83.82       10     2124.23   1.90312e+07  -0.26281  0.164816  -1.52789e+06  -1.52789e+06  24.2322   7.62212    246.81 
   22      84.24       10     2121.43   1.90102e+07  -0.411703  0.143533  -1.52789e+06  -1.52789e+06  24.2212   7.62218   257.975 
   23      84.66       10     2120.14   1.90316e+07  -0.329345  0.119595  -1.52789e+06  -1.52789e+06    24.2    7.62189    269.05 
   24      85.08       10     2118.95   1.90093e+07  -0.364902  0.0692155  -1.52789e+06  -1.52789e+06  24.2287   7.62062   280.142 
   25       85.5       10     2119.75   1.89626e+07  -0.43119  0.0840356  -1.52789e+06  -1.52789e+06  24.2205   7.61948   291.232 
   26      85.92       10     2117.06   1.8904e+07  -0.335888  0.137084  -1.52789e+06  -1.52789e+06  24.2318   7.62007   302.352 
   27      86.34       10     2122.66   1.88777e+07  -0.360147  0.304598  -1.52789e+06  -1.52789e+06  24.2614   7.61976   313.537 
   28      86.76       10       2122    1.87959e+07  -0.416127  0.158113  -1.52789e+06  -1.52789e+06  24.2626   7.61948   324.672 
   29      87.18       10     2120.53   1.87129e+07  -0.468897  0.065675  -1.52789e+06  -1.52789e+06  24.2338   7.61755   335.746 
   30       87.6       10     2123.69   1.86612e+07  -0.515321  0.0544245  -1.52789e+06  -1.52789e+06  24.2396   7.61756   346.811 
   31      88.02       10     2121.56   1.86265e+07  -0.354043  0.116921  -1.52789e+06  -1.52789e+06    24.2    7.61568   357.953 
   32      88.44       10     2119.59   1.8572e+07  -0.250709  0.242385  -1.52788e+06  -1.52789e+06  24.2599   7.61646   369.047 
   33      88.86       10     2122.09   1.85486e+07  -0.32834  0.238271  -1.52788e+06  -1.52789e+06    24.2    7.61704   380.143 
   34      89.28       10     2119.32   1.85107e+07  -0.163756  0.250877  -1.52788e+06  -1.52789e+06  24.2078   7.61672   391.229 
   35       89.7       10     2115.24   1.84676e+07  -0.382242  0.192605  -1.52788e+06  -1.52789e+06  24.2674   7.61542   402.309 
   36      90.12       10     2122.78   1.84358e+07  -0.358116  0.167207  -1.52788e+06  -1.52789e+06    24.2    7.61528   413.412 
   37      90.538      10     2122.83   1.84568e+07  -0.215879  0.181431  -1.52789e+06  -1.52789e+06  24.2663   7.61529   424.653 
   38      90.96       10      2123.4   1.85022e+07  -0.359784  0.0948528  -1.52789e+06  -1.52789e+06  24.2405   7.61412   435.721 
   39      91.38       10     2123.99   1.85598e+07  -0.394449  0.112648  -1.52789e+06  -1.52789e+06  24.2313   7.61517   446.811 
   40       91.8       10     2122.59   1.86455e+07  -0.352623  0.346769  -1.52789e+06  -1.52789e+06  24.2305   7.61449   457.938 
^C
Ctrl-C received: Stopping... Done!

I have checked:

/var/log/messages - nothing
MacroServer error log - nothing
Pool error log - there is a piece of this log:

Dummy-23       WARNING  2017-05-30 22:57:26,177 BDS7: Error firing event <EventType(name=Position, priority=2), Position(value=<Error>)>
Dummy-24       WARNING  2017-05-30 22:57:28,148 PREPZ: Error firing event <EventType(name=Position, priority=2), Position(value=<Error>)>
Dummy-27       WARNING  2017-05-30 22:57:32,426 PREPR: Error firing event <EventType(name=Position, priority=2), Position(value=<Error>)>
SardanaTP.W004 ERROR    2017-05-30 22:57:33,979 SardanaTP.SardanaTP.W004: Uncaught exception running job '_asynch_action_loop' called from thread Dummy-82:
  File "/usr/lib/python2.7/site-packages/sardana/tango/pool/MeasurementGroup.py", line 202, in Start
    self.measurement_group.start_acquisition()
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolmeasurementgroup.py", line 508, in start_acquisition
    self.acquisition.run(**kwargs)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolacquisition.py", line 73, in run
    return self._run_single(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolacquisition.py", line 95, in _run_single
    zd_acq.run(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolaction.py", line 339, in run
    get_thread_pool().add(self._asynch_action_loop, None, context)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/taurus/core/util/threadpool.py", line 135, in run
    cmd(*args, **kw)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolaction.py", line 411, in _asynch_action_loop
    self.action_loop()
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolacquisition.py", line 469, in action_loop
    acquirable.put_value(value)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolzerodexpchannel.py", line 299, in put_value
    return self.put_current_value(value, propagate=propagate)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolzerodexpchannel.py", line 282, in put_current_value
    acc_val_attr.append_value(value, propagate=propagate)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolzerodexpchannel.py", line 192, in append_value
    self.accumulation.append_value(value.value, value.timestamp)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolzerodexpchannel.py", line 66, in append_value
    self.buffer[0][idx] = value
ValueError: could not convert string to float: ERROR System Error
EventTH.W001   ERROR    2017-05-30 22:57:34,246 EventTH.EventTH.W001: Uncaught exception running job 'set_attribute_push' called from thread SardanaTP.W004:
  File "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/site-packages/taurus/core/util/threadpool.py", line 135, in run
    cmd(*args, **kw)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolaction.py", line 411, in _asynch_action_loop
    self.action_loop()
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolacquisition.py", line 469, in action_loop
    acquirable.put_value(value)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolzerodexpchannel.py", line 299, in put_value
    return self.put_current_value(value, propagate=propagate)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolzerodexpchannel.py", line 279, in put_current_value
    curr_val_attr.set_value(value, propagate=propagate)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaattribute.py", line 125, in set_value
    propagate=propagate)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaattribute.py", line 133, in _set_value
    self.fire_read_event(propagate=propagate)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaattribute.py", line 280, in fire_read_event
    self.fire_event(evt_type, self)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaevent.py", line 110, in fire_event
    self._fire_event(event_type, event_value, listeners=listeners)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaevent.py", line 132, in _fire_event
    real_listener(self, event_type, event_value)
  File "/usr/lib/python2.7/site-packages/sardana/pool/poolbasechannel.py", line 75, in on_change
    self.fire_event(evt_type, evt_value)
  File "/usr/lib/python2.7/site-packages/sardana/sardanabase.py", line 131, in fire_event
    listeners=listeners)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaevent.py", line 110, in fire_event
    self._fire_event(event_type, event_value, listeners=listeners)
  File "/usr/lib/python2.7/site-packages/sardana/sardanaevent.py", line 132, in _fire_event
    real_listener(self, event_type, event_value)
  File "/usr/lib/python2.7/site-packages/sardana/tango/pool/ZeroDExpChannel.py", line 120, in on_zerod_changed
    synch=False)
  File "/usr/lib/python2.7/site-packages/sardana/tango/core/SardanaDevice.py", line 296, in set_attribute
    priority=priority, synch=synch)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/taurus/core/util/threadpool.py", line 135, in run
    cmd(*args, **kw)
  File "/usr/lib/python2.7/site-packages/sardana/tango/core/SardanaDevice.py", line 308, in set_attribute_push
    error=error, priority=priority)
  File "/usr/lib/python2.7/site-packages/sardana/tango/core/SardanaDevice.py", line 404, in _set_attribute_push
    self.push_change_event(*args)
TypeError: Expecting a numeric type, but it is not. If you use a numpy type instead of python core types, then it must exactly match (ex: numpy.int32 for PyTango.DevLong)
Dummy-23       WARNING  2017-05-30 22:57:36,173 BDS7: Error firing event <EventType(name=Position, priority=2), Position(value=<Error>)>
Dummy-24       WARNING  2017-05-30 22:57:38,150 PREPZ: Error firing event <EventType(name=Position, priority=2), Position(value=<Error>)>
Dummy-27       WARNING  2017-05-30 22:57:42,425 PREPR: Error firing event <EventType(name=Position, priority=2), Position(value=<Error>)>

Has anyone have similar issue? What could be the reason of this issue?

reszelaz commented 7 years ago

It looks to me like a problem with the value returned by the 0D channel. I suspect that the ReadOne method returned a string instead of a float.

Maybe answers to these questions will make us closer to finding the reason of this issue.

Which 0D controller do you use?
How many 0D channels do you have in the measurement group? More than one?
What do they read?
Is it easily reproducible? Does it also happen without aborting?

izadworny commented 7 years ago

In active measurement group there are 8 0D channels:

           Name              Type                   Controller   Axis
 -------------- ----------------- ---------------------------- ------
         IMONV1   ZeroDExpChannel          BL-04BM-DIA-AMPMCU1      1
         IMONV2   ZeroDExpChannel          BL-04BM-DIA-AMPMCU1      2
         IMONV3   ZeroDExpChannel          BL-04BM-DIA-AMPMCU1      3
         IMONV4   ZeroDExpChannel          BL-04BM-DIA-AMPMCU1      4
         M1TEMP   ZeroDExpChannel                  bl-04bm-plc      1
             SR   ZeroDExpChannel                     TANGO-R1      1
    grpitch-enc   ZeroDExpChannel   bl-04bmcab04-ctl-ipapenc01      4
   plmpitch-enc   ZeroDExpChannel   bl-04bmcab04-ctl-ipapenc01      2

                         Name              Type                                 Class                                Module
 ---------------------------- ----------------- ------------------------------------- -------------------------------------
          BL-04BM-DIA-AMPMCU1   ZeroDExpChannel              TangoAttrZeroDController                    TangoAttrZeroDCtrl
                  bl-04bm-plc   ZeroDExpChannel              TangoAttrZeroDController                    TangoAttrZeroDCtrl
                     TANGO-R1   ZeroDExpChannel              TangoAttrZeroDController                    TangoAttrZeroDCtrl
   bl-04bmcab04-ctl-ipapenc01   ZeroDExpChannel          IcepapEncoderZeroDController                IcepapEncoderZeroDCtrl

IMONV1, IMONV2, IMONV3, IMONV4 - those are Keithley6482 picoammeters' channels. They read current from diagnostics elements, M1TEMP - mirror 1 temperature from PLC, SR - storage ring current from calculating device server, grpitch-enc, plmpitch-enc - monochromator grating and mirror encoders (IcePAP). First 6 0D channels read from Tango devices, grpitch-enc, plmpitch-enc read directly from IcePAP.

This issue has happened once. As you can see there are no errors or warnings on ascan output, it just froze. Ascan started at 22:49:34, the errors from Pool were reported at around 22:56 and the aborting took place next day in the morning at about 8am when we noticed that ascan stopped on 40th point.

reszelaz commented 7 years ago

Good, that means that it is is not related to the abort. It was aborted afterwards.

With so many channels it will be hard to track which one gave the wrong readout.

At some convenient moment to you could you test the following: Modify one of the controllers, so that it returns a string that can not be casted to a float e.g. "error". Make it return this string in the middle of the scan. Then, if you observe the same behavior i. e. frozen scan and the same exception traceback in the pool it would confirm my theory.

izadworny commented 7 years ago

Yes, you are right. We can reproduce exact the same behavior when the controller returns a string instead of float. Now we can modify controller to avoid this error. Thank you for your help.

reszelaz commented 7 years ago

For the moment a fix on the controller level is a good workaround. As a general solution I think that we should protect that in the Sardana core. So at least the scan does not freeze. Let's keep this ticket opened until a proper solution is implement.

Thanks to you for the report!

sardana-org / sardana

Ascan macro freeze when 0D returns a non valid value #475