Firmware specification - Githubissues

sinara-hw / Booster

Modular 8-channel RF power amplifier

Other

16 stars 3 forks source link

Firmware specification #360

Closed hartytp closed 4 years ago

hartytp commented 4 years ago

In no particular order..

Ethernet interface for configuration, status read back and diagnostics. Not fussy about what kind of interface we have here so long as it's easy for users to interact with it e.g. from python scripts
Read MAC address from EEPROM
For each RF channel
- Detect channel presence
- Readout module unique ID from EEPROM
- Read calibration data from channel EEPROM
- Control of the RF switches
- Program FET bias voltage DAC
- Program interlock threshold DACs
- Monitor interlock status
- Reset interlock status (FP button or via ethernet)
- Enable/disable channels via FP switch or ethernet
- Monitor temperature
- Program LEDs
- Power sequencing of the various rails and RF sw on turn on/turn off
- Reverse power interlock (done in software)
- Sanity check currents in sensible range and raise ERROR if not (e.g. if a FET dies so the bias current goes to 0)
- Monitor input/output/reverse power levels
- Monitor currents via ADC
Fan control (low gain PI loop for temperature stabilization)
Temperature interlock
Readout of MAC address and basic diagnostics via UART
Flashing (firmware upgrade)
Watchdog timer (boot into error state if it trips)
2-point calibration of power detectors (IIRC separate calibration of interlock threshold not required since the DACs and compactor are sufficiently accurate, TBC), using the UART/ethernet. Read back of calibration data either via UART or ethernet
Bias current calibration: user states how much current they want and the calibration routine varies the FET bias voltage to achieve this, while taking care not to trip the LDO foldback limiting. This could (should) be done using a python script and the ethernet interface.
Programming static IP address via UART
Read back of:
- Channel/chassis temperature
- Input/output/reverse power
- Fan speed
- Channel IDs
- Channel enabled/disable
- Interlock set points
- Channel temperature, chassis temperature
- Channel currents
- Interlock status and error status
- Monitor for and report any I2C errors (we've had issues here so need to be vigilant in case there is a SI issue etc)
Misc commands:
- device restart via ethernet

jordens commented 4 years ago

That's what I had on my list:

support the devices on the I2C tree (mux, io expander, dacs, adc, eeproms, temperature sensors) monitor the alert and overdrive pins
set LEDs in response to alters/trip enable/disable
regularly (fixed schedule) sample the various ADC channels
set the DAC channels
control fan PWM, simple temperature control loop with proportional control (i2c temperature * sensors to pwm duty cycle)
control the overdrive/alert clear/enable/disable pins
store calibration settings in flash/eeproms
support the SPI ethernet integrated MAC+PHY chip (wiznet W5500, e.g. https://github.com/hazelutf8/w5500)
expose that functionality via mqtt
(maybe a simple text-based control functionality via usb-cdcacm)

A couple proposals:

I'd like to reduce surface a bit. Can we focus on a good network interface that scales and interoperates well with many (a dozen) devices across the different projects (Stabilizer, Thermostat, Humpback and ARTIQ+InfluxDB) and use the UART/USB port for the minimum required (read MAC, set IP, debug logging, firmware upgrade)? Among other reasons: multiplexing both a clean request-response protocol and e.g. asynchronous logging messages over the same UART is a bad idea as you have seen.
And I would also like to ditch SCPI or ad-hoc protocols in favor of MQTT. Is that OK? You'll need to run a broker and tell the devices (static, mdns, dhcp) where that is. While we develop this, there might still be an intermediate interface that passes the same MQTT payloads+topics as request-response pairs over bare TCP. But the goal should be to retire that.
Isn't the temperature control just P currently?

hartytp commented 4 years ago

I'd like to reduce surface a bit. Can we focus on a good network interface that scales and interoperates well with many (a dozen) devices across the different projects (Stabilizer, Thermostat, Humpback and ARTIQ+InfluxDB) and use the UART/USB port for the minimum required (read MAC, set IP, debug logging, firmware upgrade)

Yes, I agree. That is my aim here. Clearly the current situation of having a poorly documented USB interface that kind of duplicates a sub-set of the SCIP functionality isn't beneficial. I'm happy to keep the USB functionality to the absolute minimum.

NB it would be good to have a quick chat with TS/CTI during the firmware development to ensure that the calibration process and diagnostics we implement works well for them.

And I would also like to ditch SCPI or ad-hoc protocols in favor of MQTT. Is that OK? You'll need to run a broker and tell the devices (static, mdns, dhcp) where that is. While we develop this, there might still be an intermediate interface that passes the same MQTT payloads+topics as request-response pairs over bare TCP. But the goal should be to retire that.

I don't have a strong opinion about the interface as I'll exclusively access through the python driver. Anything that does the job without costing excessive development time is fine by me.

Isn't the temperature control just P currently?

Maybe. I've lost track and don't trust the documentation.

On second thoughts, asking for PI was over-specifying things. The level of thermal management in Booster is way beyond what's necessary IME. Even with the fans running around 10%, nothing gets particularly hots. When I stuck a hot air gun at the air inlet the change in gain was pretty minimal.

I don't want to get into stabilizing control loops to keep the temperature stability tight for no reason. The real item here is more "do something sane with the fans".

That's what I had on my list as well.

I'm sure I've missed a few small things, but I think that's the bulk of it.

jordens commented 4 years ago

Consolidated implementation plan:

Support for the devices on the I2C tree (mux, IO expander, DACs, ADCs, EEPROMs, temperature sensors). Includes HAL driver layers where necessary.
Get EUI48 from channel EEPROMs, thereby detect channel presence
Regularly (fixed schedule) sample the various ADC and temperature channels and alert/overdrive/enable/clear signals and buttons.
Get/clear Derive reverse power and temperature software interlock flags from measurements.
Get and set calibration values (ADC conversion, DAC settings, thresholds), store in flash/EEPROMs and load on boot.
Sequencing of the various power rails, RF switch, trip clear, enable/disable pins on turn on/turn off
Get/set the DAC channels (FET bias and interlock threshold)
Get/set boot-up channel state (default being disabled/tripped state)
Couple LEDs to trip and enable/disable status flags.
Control fan PWM, proportional control, (i2c temperature sensors to pwm duty cycle), get PWM ratio and temperature
Support the SPI Ethernet integrated MAC+PHY chip (wiznet W5500). MAC address from EEPROM.
Expose get/set functionality above via network interface (JSON-over-TCP or MQTT), including Python implementation.
Minimal text console via USB-CDC for get MAC address, set/get IP address, logging messages, no particular attempt to be machine-parsable and -interfacable
Perform basic self-tests (channel presence, consistency of bias currents/voltages w.r.t. enable/alert signals, values in range)
Bias current calibration: user states how much current they want and the calibration routine varies the FET bias voltage DAC to achieve this, while taking care not to trip the LDO foldback limiting. To be done using a python script and the ethernet interface.
CPU watchdog with reset
Support for "boot-into-DFU" via console (unattended remote firmware update without ethernet and jtag)
Reset via ethernet and console
Flashing/DFU procedures and scripts

hartytp commented 4 years ago

Always boot into disabled/tripped state

I generally prefer to boot into an enabled state. The interlocks should provide enough robustness to ensure that we don't need to boot into a disabled state and it causes confusion when someone restarts the device and forgets to enable all channels.

jordens commented 4 years ago

That's very untypical and risky for a reset behavior. If you boot into enabled state then, if the watchdog resets frequently (as it typically does when there is a bug), you will be enabled most of the time independent of the hardware danger state and might not be reacting to the power and thermal interlocks. All software interlocks would be leaky and non-latching. This is why after a reset you want to get into a very safe state very quickly. If resetting the device implicitly enables all channels (also if they were disabled before for whatever reason), the watchdog does not lower the risk but increases it. Problems would be hidden.

From experience there are lots of sources of the type of confusion you mention where lack of knowledge about combined with lack of automated alterting/verification of the state of the setup send the user into a long and undefined bug hunt. Always better to automate that verification. And then this issue resolves itself and the interlocks can be fully latching.

jordens commented 4 years ago

But it's not hard to make the reset state a configurable option with the default being off.

hartytp commented 4 years ago

I agree that after a watchdog we should boot into a state with all channels off.

But it's not hard to make the reset state a configurable option with the default being off.

That would be a fine solution for me.

jordens commented 4 years ago

I agree that after a watchdog we should boot into a state with all channels off.

The same arguments would for a power outage/brown out reset or accidental/unintended restart of the device. We can't distinguish those.

gkasprow commented 4 years ago

the CPU knows that power outage occurs - the power supply of the CPU disappears as the last one. I added a large 4.7mF capacitor to make sure the P5V0 and P3V3 rail will last for a few seconds after a power outage. So just observe the PGOOD line, once it is low, you have a time to store the status.

gkasprow commented 4 years ago

There is yet another PGOOD_N7V5 which will disappear once P12V0 drops below a few V.

jordens commented 4 years ago

You seem to be suggesting to save those power supply observations in flash during each power off so they can be used at boot to decide about channel state. That's tricky for a couple reasons. And it doesn't distinguish between power outages and power switches either.

gkasprow commented 4 years ago

We have EEPROM as well :) I'm just showing possibilities.

dnadlinger commented 4 years ago

Why would you need to distinguish between power outages and the user toggling the power switch? You just want to make sure the hardware is always in a well-defined state when the power comes back on.

From a user's perspective, I would be really keen to make this as close to a bunch of MiniCircuits amps on a DC power supply as possible. Sure, the user-defined interlocks will always introduce some state by definition, but as long as the hardware isn't broken, I really don't want to care about how the amplifier works internally (I do know much more about it than I'd like), and power-cycling should get rid of any state as long as the hardware isn't known to be bad.

dnadlinger commented 4 years ago

(I do realise that there is a spectrum of failure modes from clearly benign to clearly critical, but the point I'm trying to make is that an RF amplifier really ought to be a conceptually simple device. Sure, there can be MQTT log streaming and all sorts of other fancy stuff for when you need to manage many dozens of channels – and indeed, I'd probably use those functions in my own lab –, but if a new user can't just treat it is as an RF amp with a power button and input/output connectors, the system design went wrong somewhere.)

jordens commented 4 years ago

It's not just about well-defined states. It's about safe states. If the interlock trips, you need to ensure that it does not clear unless it is explicitly acknowledged. Something is wrong and needs to be fixed before it can be attempted again. Otherwise it's not an interlock. You wouldn't want an RCD or fuse (both conceptually very simple devices) trip in your house to be cleared by a power outage. A watchdog reset, a power outage, or an unrelated restart of the device should never clear an interlock because these events do not remove the cause of the trip. If you say that's not required and a distraction, then the interlock is not required either and it can be replaced with a simple timer/limiter that turns of/reduces the channel for a minute until the power/temperature are ok again. You'd end up with a flapping channel but you would never have to do anything but wait. That's a very different level of danger, safety, and protection.

dnadlinger commented 4 years ago

Agreed, but the backup capacitor should be sufficient that you can always persist those error states, and in the absence of such flags, re-enable channels that were previously on. In fact, your fuse box analogy is spot-on – as a user, I'd like the device to behave exactly like a series of mechanical circuit breakers, one per channel. I can use them to manually switch off a channel; it then never switches on automatically. If it breaker trips, it also remains switched off until manually flipped back. But conversely, channels that were switched on should also remain on.

jordens commented 4 years ago

Right. And unless we have this emergency saving of the interlock/channel states (you may not want to save on each channel enable/disable due to endurance limits, or need some kind of rate limiting or a file system, you may have to deal with slow page/sector erases during the power off, you'll need to test it for brown outs and bouncy power switches, it'll need to be bullet-proof w.r.t. the watchdog and other software bugs interfering) the remaining and simple option is to boot into safe state. Let's do the simple and safe thing first. Why should it be the other way around? Take your lasers, bigger power supplies, or a lot of workshop machinery as another example. If you power cycle them, they will be safe. For a good reason. And it doesn't confuse people.

dnadlinger commented 4 years ago

Why should it be the other way around?

How is conditioning the user to habitually press the "enable" button after power-on different from doing so automatically? If one is a safety issue, then so is the other, and you need to solve the state persistence issue anyway. Not including the channel-enable state among the bits saved just degrades the user experience for no good reason.

Take your lasers, bigger power supplies, or a lot of workshop machinery as another example. If you power cycle them, they will be safe. For a good reason. And it doesn't confuse people.

As mentioned above, this is besides the point, but still: Not a good comparison. Lathes maim people. A low-power RF amplifier doesn't, and even then, Booster has hardware current limits to prevent catastrophic channel failure.

hartytp commented 4 years ago

It's not just about well-defined states. It's about safe states. If the interlock trips, you need to ensure that it does not clear unless it is explicitly acknowledged. Something is wrong and needs to be fixed before it can be attempted again. Otherwise it's not an interlock...

If you say that's not required and a distraction, then the interlock is not required either and it can be replaced with a simple timer/limiter that turns of/reduces the channel for a minute until the power/temperature are ok again. You'd end up with a flapping channel but you would never have to do anything but wait. That's a very different level of danger, safety, and protection.

In concrete terms, what situation are we worried about here? An interlock tripping and then a brown-out of the building power occurring before anyone realizes the interlock had tripped? Presumably this isn't an issue with watchdogs since we already agreed (I think) that after a watchdog reset the system should boot into an error state that needs clearing e.g. by resetting the system (as the current firmware does).

Interlock trip rarely and, when they do, they generally get spotted quickly -- either by an automated system (whether a logger that polls Boosters or, say, an SU-servo starts clipping) or just the fact that a beam-line having no power causes obvious symptoms on the experiment -- so the window for this is small. Since brown-outs are also rare, missing an interlock due to a brown-out is a very rare (second-order) event.

Since Booster nominally guarantees no damage to itself or the load once the interlocks are set, interlocks tripping are generally not a dangerous fault, but rather a small issue somewhere (overly aggressive interlock combined with frequency-dependent gain/VSWR; mistake in device_db/Urukul attenuator; etc). So, it's something you want to know about, but if you miss it once in a blue moon it's not the end of the world; you'll get another chance soon enough.

hartytp commented 4 years ago

Take your lasers, bigger power supplies, or a lot of workshop machinery as another example. If you power cycle them, they will be safe. For a good reason. And it doesn't confuse people.

I.e. the critical difference here is that if a milling machine turns on unexpectedly it could kill someone. I don't see a situation where Booster booting with all channels on could hurt someone or cause damage. The only exception I can think of being if someone: set the interlock to a high value (e.g. from a previous experiment); manually disabled all channels; connected a fragile load that can't take the set max power; connected an active RF source to the input with a power set beyond the load damage threshold; suddenly there was a brown-out which caused the switch to open and damage the load (without also disabling the RF source). But, that seems like a pretty contrived set of circumstances and if people are doing that kind of thing they're probably breaking plenty anyway...

hartytp commented 4 years ago

Booster has hardware current limits to prevent catastrophic channel failure.

Other than a bug in the calibration code, I can't think of any situations where the FET currents could limit without there already having been an at least moderately catastrophic failure.

dnadlinger commented 4 years ago

Other than a bug in the calibration code, I can't think of any situations where the FET currents could limit without there already having been an at least moderately catastrophic failure.

I meant the smoke-and-flames kind of catastrophic, since Robert was bringing safety into the discussion.

hartytp commented 4 years ago

Aah, yes! AFAICT either we can trust the guarantee that (assuming interlocks have been appropriately set) Booster cannot damage itself or the load there isn't much to worry about here. If we can't trust that then presumably it's due to a catastrophic failure of the hardware and then all bets are off...

jordens commented 4 years ago

How is conditioning the user to habitually press the "enable" button after power-on different from doing so automatically? If one is a safety issue, then so is the other, and you need to solve the state persistence issue anyway. Not including the channel-enable state among the bits saved just degrades the user experience for no good reason.

That line of argument would invalidate any design that starts in a safe state. The crucial thing about acknowledging a tripped interlock is that it can't be accidental and implicit by design. It needs to be a hurdle. If the user automates the trip acknowledgement explicitly (thereby taking responsibility) that's fine. Whether they do it via ethernet, by shorting interlocks on lasers and power supplies or by going through the list of lasers to blindly punch "on" each morning doesn't matter from a safety perspective.

Other than the calibration parameters there is no persistent state in the proposed implementation.

As mentioned above, this is besides the point, but still: Not a good comparison. Lathes maim people. A low-power RF amplifier doesn't, and even then, Booster has hardware current limits to prevent catastrophic channel failure.

Why should only devices that can maim people start in an off state? I don't see the logic in saying that if a device can only damage itself or other devices then prioritize user experience over safety.

Presumably this isn't an issue with watchdogs since we already agreed (I think) that after a watchdog reset the system should boot into an error state that needs clearing e.g. by resetting the system (as the current firmware does).

No. We didn't agree that these features would be implemented. As I described explicitly above, the current list does not automatically save channel state. For the reasons see above.

Interlock trip rarely and, when they do, they generally get spotted quickly

I suspect this assumption is already proven wrong right after construction when testing the device. A proper PTS should test that the interlock operates repeatedly and in rapid succession. Didn't your stress testing scripts do that as well?

I don't see a situation where Booster booting with all channels on could hurt someone or cause damage.

That is not the situation I described. Channel state is not saved, There is a bug somewhere or hardware issue that repeatedly trips the watchdog. And there is a channel with an unconnected output but an input signal. In that case the unsafe channel would be mostly on. If that can't cause damage then the entire reverse power interlock appears pointless. This is not a contrived situation. It's the feature set as described, watchdog loops are guaranteed to occur at some point, and that channel configuration is precisely the main use case of the interlock.

Booster has hardware current limits to prevent catastrophic channel failure.

If the software interlocks don't protect against failure or damage, what do they protect against? If they do protect against failure or damage, shouldn't they do it all the time including in the situation described above?

dnadlinger commented 4 years ago

The crucial thing about acknowledging a tripped interlock is that it can't be accidental and implicit by design.

Other than the calibration parameters there is no persistent state in the proposed implementation.

These two statements are not compatible. If a tripped interlock can be reset by a power cut, then it can be reset accidentally.

Whether channels are enabled automatically or not after reboot is irrelevant for this (i.e., as long as the interlock tripped state is not persisted).

To put it differently:

If the user automates the trip acknowledgement explicitly (thereby taking responsibility) that's fine.

Unless you persist interlock state across interruptions, that's not a thing. You can't acknowledge something if you don't know it happened.

Why should only devices that can maim people start in an off state? […] prioritize user experience over safety.

You appeared to be referring to safety as in workplace Health and Safety when mentioning machine shop tools and lasers in the above message, but quite apparently that is not the definition you were going for. What is it?

dnadlinger commented 4 years ago

That is not the situation I described. Channel state is not saved, There is a bug somewhere or hardware issue that repeatedly trips the watchdog. And there is a channel with an unconnected output but an input signal. […]

Note that the interlock state is latched in hardware until ON_OFF is toggled. If you read that back on boot (possibly caused by watchdog reset), you should be able to detect that scenario. Conversely, if you assume the CPU+firmware is messed up to the point where ON_OFF is toggling randomly, then you really need a hardware rate limit to avoid resetting interlocks too often; no piece of source code is going to be able to save you then.

that channel configuration [|S_22| = 1] is precisely the main use case of the interlock

Somewhat tangential to this discussion, the main use case of the interlock subsystem for us is to limit the output power. Not having to care about reverse power is certainly convenient during setup, but can easily be avoided, whereas Urukul attenuator setting bugs/… that would overload the device connected to the Booster output can't. (I don't think I've ever seen a reverse power interlock trip in my lab.)

jordens commented 4 years ago

These two statements are not compatible. If a tripped interlock can be reset by a power cut, then it can be reset accidentally. Whether channels are enabled automatically or not after reboot is irrelevant for this (i.e., as long as the interlock tripped state is not persisted).

They are compatible. Just do the sane thing and start with the channels disabled by default. That's the same state as if there had been an interlock condition. In any case it's the state that's compatible with the no-accidental-clear and no-persistent-state conditions.

Unless you persist interlock state across interruptions, that's not a thing. You can't acknowledge something if you don't know it happened.

Sure you can acknowledge it. You'd do so by clearing the error and enabling the channel. There was never the idea of acknowledging anything specific or individually. There is no finer granularity than acknowledging all interlock reasons at once. Maybe it helps you to consider the boot an interlock reason.

but quite apparently that is not the definition you were going for. What is it?

If you need one, take "the condition of being protected from harm or other non-desirable outcomes". It's not limited to maiming people. An interlock is a safety mechanism.

If you read that back on boot (possibly caused by watchdog reset), you should be able to detect that scenario.

How would that ever work? You'd have to leave ON_OFF high during reset to not clear the FFs. But it's pulled low.

And the reverse power interlock or the overtemp interlocks are not latched in hardware.

Conversely, if you assume the CPU+firmware is messed up to the point where ON_OFF is toggling randomly, then you really need a hardware rate limit to avoid resetting interlocks too often; no piece of source code is going to be able to save you then.

I'm not looking for a solution that covers all cases. But one that covers those that I've seen myself and expect to see again is important. And it's preferable over a solution that prioritizes user experience over any safety.

the main use case of the interlock subsystem for us is to limit the output power.

I was explicitly referring to the reverse power interlock.

Not having to care about reverse power is certainly convenient during setup, but can easily be avoided, whereas Urukul attenuator setting bugs/… that would overload the device connected to the Booster output can't.

There are a couple of implications in there that bother me. It looks like Booster is way to powerful for your use cases. For AOMs and EOMs you usually have plenty of room to comfortably place the maximum output power between the efficiency roll-over and the damage threshold. You say unconnected outputs are easily avoided. But so is too much input power or output power. Connect an attenuator. If you dislike that, so do others dislike having to ensure all outputs are always terminated.

(I don't think I've ever seen a reverse power interlock trip in my lab.)

I've seen enough live power amps becoming disconnected at the output. Are you distinguishing and recording the interlock cause? What about your stress testing or testing during production? Is the reverse power interlock never tested?

dnadlinger commented 4 years ago

[…] start with the channels disabled by default. That's the same state as if there had been an interlock condition.

No, it isn't. If the interlocks trip in normal operation, the respective alert LEDs/remote interface flags go on, alerting the user that something goes wrong. If that does not persist across restarts, the states aren't the same.

Again, without persistence, where is the difference between enabling the (previously enabled) channels some few seconds after bootup, and having the user press a button to do so?

To put it differently:

Maybe it helps you to consider the boot an interlock reason.

If we do that, interlocks asking to be reset is normal behaviour. Consequently, we'll need to make sure that there can't be negative consequences to the user trying to reset the interlocks, as there is no way for them to distinguish normal power-on from the abnormal cases. But then, it is also safe to just do that automatically.

How would that ever work? You'd have to leave ON_OFF high during reset to not clear the FFs. But it's pulled low.

You may be right; I didn't check very carefully whether it would be possible to implement this in hardware only as-is, as I'm not very interested in that approach.

Stepping back a bit from quibbling over semantics, what precisely is the scenario you are worried about here?

Earlier, you mention a watchdog loop. This requires is either a software bug or a severe hardware issue. This is significant, as we have already left the land of clear-cut logic and are discussing probabilities and mitigation of certain failure scenarios – attempts at the former would necessarily fall short at that point; ex falso quodlibet.

For the loop to be relevant to the discussion about channel enabling behaviour, the reset would need to occur at some point after Booster has booted again and the channels have been enabled again – and thus the unrelated interlock-trip condition has occurred. However, in a model where interlock state is persisted, it also needs to occur before the firmware had a chance to save the affected channel's interlock status (I think many STM32F4xxxs have backup registers, if worried about storage durability).

Now, IIRC, detecting watchdog resets is possible by reading an RCC register on boot. Thus, there needs to be an additional bug in that code, such that Booster continues to boot normally despite the watchdog failure.

Furthermore, the nature of the error must be such that repeated attempts at enabling the channel after all the above actually cause permanent damage. Let's assume Booster runs unsupervised for a weekend and the cycle of reboots and the reverse power interlock tripping on a channel repeats every 5 s. Is it plausible for the resulting ~30 k cycles to cause hardware damage in a scenario where just a few interlock trips don't? Possibly, but it further diminishes the likelihood of this being an issue.

All in all, is this scenario really that much more of a concern than any other number of theoretically possible, but practically rare scenarios – I don't know, maybe the output FET and coupling capacitor shorting, and Booster consequently dumping Vdd into an unsuspecting 50 Ω load? (Edit: Okay, there are actually two AC coupling capacitors between output stage and output connector, so that's not the best example, but you get the point – there are a large number of possible but extremely unlikely failure modes.)

I was explicitly referring to the reverse power interlock. […] There are a couple of implications in there that bother me. [snip]

None of this seems relevant. I just wanted to make sure you are aware of how usage patterns in a (the) real-world deployment look like. Yes, it is just one use case that isn't intrinsically guaranteed to be representative of the majority of them (although it probably isn't far off from most ion trap university research), but with gathering requirements often being the hardest part of software design, this should be pertinent information.

The only reason I can even be bothered to type all this out is that I'd really like to avoid fragility and unforced complexity in the systems design. A design that fails silently and then also forgets its configuration (that is, just boots into all channels disabled on error), is one instance of that, as it just pushes the responsibility for managing that state onto the user.

Our experiment consists of hundreds of different gadgets, and things like unplanned power cuts do sometimes happen. I'd really like to avoid adding more devices that react to this by coming back into a state with silently changed configuration; the ones we are stuck with are already annoying enough.

jordens commented 4 years ago

as I'm not very interested in that approach.

But you proposed it. If the proposal doesn't hold water, was it just to distract?

Stepping back a bit from quibbling over semantics, what precisely is the scenario you are worried about here?

We'll start with safe design that minimizes the chances of destroying channels already during development and testing. I know with comfortable certainty that the firmware development will bring the hardware into dangerous states. I want to reduce those as much as possible. Any suggestion of what could or might be implemented to unload the user and automate further (backup registers, emergency saving of state during power supply cuts, etc) is not something I'm inclined to look at right now because it needlessly increases complexity and risk at the wrong time. Please request a quote to implement that later.

This requires is either a software bug or a severe hardware issue. This is significant, as we have already left the land of clear-cut logic and are discussing probabilities and mitigation of certain failure scenarios – attempts at the former would necessarily fall short at that point; ex falso quodlibet.

I'm unsure what "land of clear-cut logic" we have left. Do you feel the scenario of destroying hardware during development and testing is irrational and speculative? I have trouble parsing your statement. Are you saying that it is not worth trying to limit the effect of potential software bugs by conscious design decisions about software behavior? Sure, it's never going to be a complete solution (as it is for hardware issues). But if you have a choice between exactly two behaviors, then why not start with the safer one? "Because it means more load on the user" is surely not the guiding metric. I'm lost about the contradiction that you claim to have identified. Please explain.

Is it plausible for the resulting ~30 k cycles to cause hardware damage in a scenario where just a few interlock trips don't? Possibly, but it further diminishes the likelihood of this being an issue.

I seem to remember that such a "possibly plausible" issue with "diminished likelihood" came up promptly during your stress testing already. From that past data it seems prudent to expect similar issues again.

dumping Vdd into an unsuspecting 50 Ω load?

Note that the coupler is "DC short to GND". Probably another red herring. You might say you're not interested in this scenario. Yet you brought it up and took double care to proof it.

Sure there are lots of other risks. It's necessary to go through them and evaluate/address them all. That's a fact of system design. How does that matter?

None of this seems relevant.

It is relevant to me. You can belittle this all you want. In the end it's not you who is responsible. Though if you prefer, we can change that.

I just wanted to make sure you are aware of how usage patterns in a (the) real-world deployment look like.

Having spent more than a decade with such real-world deployments in various different labs permits me to develop a view. But I'm always keen on seeing a wider spectrum of use cases and in the couple years in the lab you have probably developped some. If you know your requirements as well as you claim, and if you think that nobody else shares that knowledge, you need to bring it in much earlier.

The only reason I can even be bothered to type all this out [..]

Nobody forces you. You seem to just have missed your chance to engage and spell out what you want early enough. This is a problem.

Why do you think I am here and still discussing this with you? I'm happy to just shelve it now and meet again at the other end to see whether the firmware meets the agreed requirements. Certainly less trouble for me.

A design that fails silently and then also forgets its configuration

I don't get that. It doesn't fail silently but with multiple indications: the LEDs are on, the channels off. And it doesn't forget its configuration: channel on/off is not "configuration" (i.e. persistent). With that agreed feature set, we have to choose between all-on or all-off at boot. Given what I have seen in the past and with the information I have currently, the safe decision is all-off. If you know that there is no added risk in booting into all-on and willing to take the responsibility, then we can remove that fail-safe.

unplanned power cuts

If you take that seriously and prepare for them with any level of thoroughness, you'll have a written procedure in place. Then checking things and pushing buttons on devices to acknowledge that is what you are already doing. Sure you might be able to automate things to reduce downtime, but compared to the other items in the recovery procedure (in "real-world deployments" with lasers, computers, and ions in traps), this will have a comparably low impact.

silently changed configuration

That's inaccurate. Channel on/off state is not part of the persistent configuration. Note that the proposal to boot into all-on does not persist the channel state either and by that argument is at least as undesirable.

dnadlinger commented 4 years ago

This is clearly getting nowhere. I seem to have made the wrong assumptions on the state of the project/purpose of this issue last month: If your comment above was merely to share an implementation plan that is already set in stone, then yes, of course any expectations that you ought to be interested in user feedback would be misplaced. (For context, I e.g. wouldn't have been aware of contracts already signed at that point, or what the "agreed feature set" you mentioned is.)

For my part, I'm done here, and will likely continue to use – and encourage others to use (and improve) – the existing, open-source firmware. That being said, if someone else with a substantial deployment has time to evaluate the rewrite and it ends up being more useful overall, I'll certainly be happy to use it once it's available. (I'm sure it is going to end up much cleaner implementation-wise, which I'm certainly looking forward to!)

Just to clear up a few things, though:

It is relevant to me. You can belittle this all you want.

I wasn't belittling you or your priorities in any way. What irritated me is that you responded to information about a target use case (forward power interlocks trip more often here in our lab – reasons for which I'd be happy to provide) by pointing out a different requirement (reverse power interlocks should protect the hardware reliably). The latter statement is true, but not relevant for the truthfulness of the former. That's all.

I seem to remember that such a "possibly plausible" issue with "diminished likelihood" came up promptly during your stress testing already.

FYI, what came up so far were clear hardware issues (e.g. switch transients exceeding voltage specs) that didn't need many cycles to manifest.

That's inaccurate. Channel on/off state is not part of the persistent configuration. […]

I'm really at a loss as to how you would think this to be a helpful comment, at least if the desired outcome is a productive design discussion. Clearly, I was arguing that the on/off state should persist.

hartytp commented 4 years ago

being developed at quartiq/booster