taurus-org / taurus

Moved to https://gitlab.com/taurus-org/taurus
http://taurus-scada.org
43 stars 46 forks source link

Reconsider Tango events and Taurus polling design #1092

Open reszelaz opened 4 years ago

reszelaz commented 4 years ago

This issue is to summarize the design discussion with @cpascual about the Tango change events and Taurus polling relation.

We have a feeling that the current relation between these two provokes situations extreamly difficult to control in the application and in some cases of waste of resources. Most probably this originates from old Tango systems where the events worked thanks to the notifd.

What we would like to achieve is that:

  1. TangoAttribute subscribing to events would start polling only if the corresponing DS is running and if the events subscription fails cause these are not configured. This translates into the following DevFailed reasons: API_AttributePollingNotStarted, API_EventPropertiesNotSet. Note that API_DSFailedRegisteringEvent and API_PollObjNotFound could be also evaluated but would need to be re-checked if these may really occur (I could not find them in theis docs). This scenario should not fallback to stateless subscription. If it happens that you configure events after the Taurus application start-up, then you should restart it. If there is a real necessity for this use case we could consider supporting it.
  2. In all the other cases it should simply fallback to subsribe in stateless mode - events will start working whenever available.
  3. None of the error events should enable polling, especially the one of the API_EventTimeout reason. This error is raised either due to the DS crash - no need to poll cause no one will reply or due to the heartbeat event not processed on time. This second case could happen cause you have the event consumer thread too busy - the events will be processed but with delay so we believe there is no need to stress the system more with Taurus polling. Or you don't receive hearbeat events at all - then you have a bigger problem and it is better to fix it instead of sweeping under the rug with Taurus polling.

All the above reasoning was done based on the Tango systems 8 or higher. We were thinking in maintaining the backwards compatibility for the older versions.

Some ideas for the implementation:

  1. The interrogation of the Tango version on the server side can be done with DeviceProxy.get_tango_lib_version() (3ms penalty for first attribute of a device and 30 us for the next ones from the same device) and this would need to be done in the first event callback and filled in the cache per attribute. We called it REASONS_TO_POLL. So to fullfill with the backwards compatibility for Tango < 8 this will become basically the current EVENT_TO_POLLING_EXCEPTIONS list and for the newer versions this become an empty list.

  2. If the above mentioned extra time penalty is not acceptable for pure Tango > 7 systems or one would like to sacrify the notifd failures fallback to Taurus polling we could add a tauruscustomsetting for always using the new behavior.

It would be nice if some Tango experts could review this reasoning. Please, @tiagocoutinho, @sergirubio, ... could you review this, correct, complete or comment. Many thanks in advance!

tiagocoutinho commented 4 years ago

@reszelaz sorry for editing your message. I just replace bullets with numbers to be easier to mention your points. I hope you don't mind

tiagocoutinho commented 4 years ago

Overall I agree that this is overly complex and I am all in favour of simplifying the algorithm.

Considering your point 1:

  1. agree that if the server event configuration is changed then the gui should have to be restarted to reflect the changes
  2. The policy of subscribing first would be great if tango was not serializing event subscriptions at the process level. I understand that what you propose would not make things worse than they are but I think it is important that the solution should fix the typical long startup of GUIs caused by this limitation (see my proposal below)

Point 3: agree

Proposal:

  1. Why not simply start by polling and at the same time register subscriptions in a queue handled by a separate thread (a single subscription thread would be enough). When a subscription succeeds we stop polling and turn to events.
  2. Would you consider not subscribing to configuration events at all? I would simply read once the attribute configuration and use it during the taurus attribute lifetime. We loose the ability to to be notified if configuration changes but consider that this is very rare (as rare as changing a server attribute to start polling internally and sending events. maybe even more rare)
  3. A side note related to above: I would ask tango c++ if they consider adding the unit to the fields sent when we read an attribute. It is the only field we are missing to be able to say that the value is completely specified by a single atomic response (maybe we would need to wait for tango 10 since this would change the IDL)
  4. What about avoiding to create a state attribute when the device is created?

Sorry I raise more questions than solutions :-P

sergirubio commented 4 years ago

Hi Tiago,

Your points 1,2 and 3 is what we did in the control room a year ago to solve the event-related problems in RF gui's.

It is our intention to apply the same solution to PC gui's in the next weeks.

A patch and a recipe already exists in taurus to implement this solution, using an option at startup to not register events and then a WorkerThread class (I think it is still part of taurus) to subscribe to events in the background.

So, I agree with your proposals 1-3, as they are already implemented. But, I do not understand the last point ... why to avoid state attribute!?!?

Sergi

Overall I agree that this is overly complex and I am all in favour of simplifying the algorithm.

Considering your point 1:

  1. agree that if the server event configuration is changed then the gui should have to be restarted to reflect the changes
  2. The policy of subscribing first would be great if tango was not serializing event subscriptions at the process level. I understand that what you propose would not make things worse than they are but I think it is important that the solution should fix the typical long startup of GUIs caused by this limitation (see my proposal below)

Point 3: agree

Proposal:

  1. Why not simply start by polling and at the same time register subscriptions in a queue handled by a separate thread (a single subscription thread would be enough). When a subscription succeeds we stop polling and turn to events.
  2. Would you consider not subscribing to configuration events at all? I would simply read once the attribute configuration and use it during the taurus attribute lifetime. We loose the ability to to be notified if configuration changes but consider that this is very rare (as rare as changing a server attribute to start polling internally and sending events. maybe even more rare)
  3. A side note related to above: I would ask tango c++ if they consider adding the unit to the fields sent when we read an attribute. It is the only field we are missing to be able to say that the value is completely specified by a single atomic response (maybe we would need to wait for tango 10 since this would change the IDL)
  4. What about avoiding to create a state attribute when the device is created?

Sorry I raise more questions than solutions :-P

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/taurus-org/taurus/issues/1092#issuecomment-601610702

cpascual commented 4 years ago

Replying to both @sergirubio and @tiagocoutinho :

regarding initialization

Why not simply start by polling and at the same time register subscriptions in a queue handled by a separate thread

A patch and a recipe already exists in taurus to implement this solution,

Good point @sergirubio . I guess the API you are referring to is the one implemented in #605 @tiagocoutinho would this be what you had in mind?

In such case, it would really be nice to have some snippet that we can use for testing

regarding state attribute reference

What about avoiding to create a state attribute when the device is created?

@tiagocoutinho : are you proposing not to keep the state object attribute reference in the device obbject at all? or just to delay creation until someone accesses it?

Delaying might be feasible, but not having it may break the existing API unless we accept re-creating the object every time it is needed...

On the other hand, not maintaing the reference would avoid one circular reference problem that we currently have (but this could also be dealt with a weakref)

regarding (not) switching to poll on errors

This refers to points 1 and 3 from @reszelaz comment: I understand that @tiagocoutinho agrees to change the current behaviour and not switching to polling e.g. on API_EventTimeout (at least for Tango>7 on the server).

More opinions on this? @sergirubio ?

reszelaz commented 4 years ago

regarding initialization

I agree that starting with polling and then switching to events will probably give more benefits.

regarding state attribute reference:

The creation of the state attribute is already delayed.

regarding not subscribing to configuration events

+1 for suggesting Tango to include unit in the attribute read value.

Now quickly thinking, in Sardana we use the attribute config (range) for software limits of moveables. The PMTV (GUI) is relying on the limits configuration with buttons to move to the limit. So not having the events would mean reading the configuration everytime we use these buttons. I think these buttons are so seldomly used so it won't be even noticable so of course we could change it. But I wonder if there may be other applications relying on the configuration events? So maybe we could make it optional to subscribe to them?

regarding (not) switching to poll on errors

This is actually the reason of the whole thread. At least for Sardana it will fix a lot of issues when starting/stopping/restarting servers for operation and for testing. I think @cpascual and @tiagocoutinho already agreed on this. @sergirubio what about you? Any other opinions? If you agree I could start working on a PR for this precise issue.

sergirubio commented 4 years ago

I agree, but I would like if the behaviour can be controlled using arguments or taurus options

(i mean, there are too many particular cases to say if it's always good or bad to switch or not, I would like to have the control over it)

Sergi

On 5/18/20 3:43 PM, reszelaz wrote:

regarding initialization

I agree that starting with polling and then switching to events will probably give more benefits.

regarding state attribute reference:

The creation of the state attribute is already delayed https://github.com/taurus-org/taurus/blob/9c31f0d3655987e7ef0963d2952595791a944d43/lib/taurus/core/tango/tangodevice.py#L418.

regarding not subscribing to configuration events

+1 for suggesting Tango to include unit in the attribute read value.

Now quickly thinking, in Sardana we use the attribute config (range) for software limits of moveables. The PMTV (GUI) is relying on the limits configuration with buttons to move to the limit. So not having the events would mean reading the configuration everytime we use these buttons. I think these buttons are so seldomly used so it won't be even noticable so of course we could change it. But I wonder if there may be other applications relying on the configuration events? So maybe we could make it optional to subscribe to them?

regarding (not) switching to poll on errors

This is actually the reason of the whole thread. At least for Sardana it will fix a lot of issues when starting/stopping/restarting servers for operation and for testing. I think @cpascual https://github.com/cpascual and @tiagocoutinho https://github.com/tiagocoutinho already agreed on this. @sergirubio https://github.com/sergirubio what about you? Any other opinions? If you agree I could start working on a PR for this precise issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/taurus-org/taurus/issues/1092#issuecomment-630191948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMQVKNANSCZFKURBIY5HCTRSE3QHANCNFSM4LPZTFAQ.

-- श्री