sinara-hw / Booster

Modular 8-channel RF power amplifier
Other
16 stars 3 forks source link

NACKs observed while communicating with fan controllers #383

Open ryan-summers opened 3 years ago

ryan-summers commented 3 years ago

Booster has been observed to sporadically indicate NACKs from the fan controllers. Reference https://github.com/quartiq/booster/issues/140 and https://github.com/sinara-hw/Booster/issues/379

hartytp commented 3 years ago

@gkasprow this seems to be the last reliability issue with Booster.

The issue seems to be (using the current quartiq firmware as a reference implementation) the fan controllers give infrequent (typically only once per ~days) NACK errors when operating at 100kHz bus frequency.

I don't see anything obviously wrong in the design, but could you quickly stick a scope on them and verify the timing/voltage characteristics please? It might be that something is a little marginal...

image

gkasprow commented 3 years ago

I observe similar issues in another project I have two Texas chips on the same bus One is a battery charger, other is BMS I decided to separate the buses because observed errors on the bus from time to time Could be an issue here as well. It seems that SMBus is not fully compatible with I2C.

gkasprow commented 3 years ago

I looked at the signals with the scope and see nothing suspicious. What if we move the fan control SMBus to a dedicated I2C controller? We have a free I2C controller available on pins PA8 and PC9. PC9 is used by onboard LED but we don't have to use it.

gkasprow commented 3 years ago

@jordens how do you use the I2C peripherals in STM32? Are you toggling the I2C controller mode between I2C and SMBus before talking to the ICs?

hartytp commented 3 years ago

@gkasprow are you happy that none of the timings / voltage levels look marginal?

gkasprow commented 3 years ago

I tried with open-source firmware. I tried to convert the bin file from Quartiq's release to dfu but for some reason, it doesn't work. Do you have a working dfu file?

jordens commented 3 years ago

We don't switch to SMBus since we don't use or need any of the SMBus features in host mode that would be available at the peripheral level. And the rest is a subset of i2c. That's the same as for the old firmware. The command to do a dfu upload is dfu-util -a 0 -s 0x08000000:leave --download booster.bin. I don't know how to generate other dfu file formats. The only information required other than the bin is where to write it: 0x08000000.

gkasprow commented 3 years ago

The only information required other than the bin is where to write it: 0x08000000. thanks, that was what I did wrong :) Now it works.

gkasprow commented 3 years ago

the signals look OK. The edges are fast, setup-hold times are met. tek00021

ryan-summers commented 3 years ago

It looks like there are some spurious pulses present, although they appear right at the falling edge, so I suspect this is just transition delay. However, I'm still somewhat surprised to see them.

jordens commented 3 years ago

Those are ACKs

gkasprow commented 2 years ago

Michal spent a lot of time with I2C debugger and scope trying to catch these NACKs. It looks it appears only in the STM32 core but not on the I2C bus. Maybe we should consider replacing fan controllers with Microchip ones?

gkasprow commented 2 years ago

What we can do is to move the fan control to a dedicated I2C bus used only by the EEPROM. What do you think @jordens We can change HWREV so the CPU knows about it.

gkasprow commented 2 years ago

Since neither I nor Michal managed to catch any of these NACKs using regular firmware, I'd need a small modification to the firmware that would trigger the scope once it detects NACK. I have a scope with a very long memory and will be able to decode all transfers that appeared before. I'd like to look at it before we release the new HW.

jordens commented 2 years ago

@ryan-summers Maybe you can help and spin that special firmware.

jordens commented 2 years ago

The NACks are rare, It's possible that you don't see them for many days. it's best to provoke the explicitly by continuously running transfers on that chip like https://github.com/quartiq/booster/issues/140#issuecomment-916770336 etc. And I don't think the old firmware does a lot of traffic. Changing the hardware should be done once we know reasonably well what the problem is. But yes, it it's I2C SI then reorganize the buses.

gkasprow commented 2 years ago

So can you prepare such firmware that generates a lot of traffic on these chips and triggers scope when NACK?

ryan-summers commented 2 years ago

@gkasprow I've developed a version of the 0.3 release candidate that will toggle the mainboard LEDs (LD9-LD11) on for a very brief moment after the NACK is observed. The LEDs will be de-asserted as soon as the transaction succeeds on the re-attempt.

Let me know directly on that PR if you need different behavior for your capture and I'll get it set up.

Edit: The PR is https://github.com/quartiq/booster/pull/202 and the branch is rs/led-toggle-on-bus-nack

gkasprow commented 2 years ago

Thanks, I'll look at that once I get my Booster back. How to compile that release? Or, can you generate the dfu file for me?

jordens commented 2 years ago

The actions run should have binaries and there are instructions on how to use them.

jordens commented 2 years ago

Ah. It doesn't. But see that PR for them.

ryan-summers commented 2 years ago

I've just updated that branch to upload the DFU file. Should be available at https://github.com/quartiq/booster/actions/runs/1812387122 now.

gkasprow commented 2 years ago

@jordens I'm preparing a new HW revision. What about adding the option of routing a dedicated I2C bus to the fan controller? I will add 0R resistors so one can always revert to the original solution. I can also add Microchip fan controllers as the assembly option to play with them and possibly mitigate chip shortages.

jordens commented 2 years ago

Ok. But populate the old i2c bus connectivity by default, not that new one. then modify a device and test modified firmware.

gkasprow commented 2 years ago

ACK. @jordens Since we are using only 5 fans and external temperature sensor, what about adding EMC2305 or MAX31785. I'd add it as an assembly option to test.

jordens commented 2 years ago

Both sound OK to me on a quick glance.

gkasprow commented 2 years ago

Let's go for EMC2305. It's simpler, cheaper and from different vendor so has different I2C IP-core in it :)

gkasprow commented 2 years ago

We could theoretically solder both fan controllers and enable one of them in software :)

jordens commented 2 years ago

@gkasprow any results from testing that firmware and debugging the NAKs?

gkasprow commented 2 years ago

@michalgaska was testing it with two channels installed. Any conclusion?

gkasprow commented 2 years ago

Michal says he is able to catch the issues. He will post diagrams.

michalgaska commented 2 years ago

I2C before and after NAK: IMG20220314112442

gkasprow commented 2 years ago

please dump here raw data (10M points). The image is not saying anything.

michalgaska commented 2 years ago

Here is the link to the raw data in .CSV: https://drive.google.com/file/d/1cZY_hTal67UYRY69qNyjCySIduJB1WnA/view?usp=sharing

ryan-summers commented 2 years ago

I imported the data into Sigrok's PulseView for analysis:

sigrok-cli -I csv:header=yes:start_line=15:column_formats=t,a,a,a,*- -i .\tek0002ALL.csv -o all.sr
// Then, open all.sr in PulseView

It looks like there's a ton of noise on the I2C clock line in the raw CSV data: image

Closer view on the SCL line without overlays: image

Do these artifacts also appear on the actual oscilloscope display? If so, that definitely is a cause for concern.

ryan-summers commented 2 years ago

image image

An additional comment here is that the MAX6639 appears to require a minimum 4.7 uS clock high period, but the current capture appears to indicate that we may be violating that with a high period of only 4.5 uS.

We may benefit by slightly decreasing the I2C bus speed in firmware as a result.

gkasprow commented 2 years ago

it must be some artifact. The signals look clean on the scope/