oxidecomputer / humility

Debugger for Hubris
Mozilla Public License 2.0
526 stars 50 forks source link

Add ability to enable/disable individual phases on Renesas power controllers #367

Open ericaasen opened 1 year ago

ericaasen commented 1 year ago

This is an issue to track the various logic analyzer dumps of what is happening to the controller from PowerNavigator when I do this through the GUI.

A DSLogic U3Pro16 was used to capture all of these. I included a csv decode of the transactions and the raw capture in each of the dumps below. You can get pre-compiled versions of the Dream Source Lab GUI for Windows, Linux, and OS X here

If you are so inclined, the source code for the GUI is theoretically here

All captures done on Gimlet: 0XV1:9130000019:006:BRM44220021

ericaasen commented 1 year ago

Folder containing all dumps is here

  1. Starting with a default config, disable Phase 19 of U350 (VDD_VCORE controller): 0x5A_VDD_VCORE_controller_phase19_disable
  2. Disable Phase 18: 0x5A_VDD_VCORE_controller_phase18_disable
  3. Disable Phase 17: 0x5A_VDD_VCORE_controller_phase17_disable
  4. Disable Phase 16: 0x5A_VDD_VCORE_controller_phase16_disable
  5. Disable Phase 15: 0x5A_VDD_VCORE_controller_phase15_disable
  6. Disable Phase 14: 0x5A_VDD_VCORE_controller_phase14_disable
  7. Disable Phase 13: 0x5A_VDD_VCORE_controller_phase13_disable

Now that everything but Phase 12 is disabled, enable the VDD_VCORE regulator in Fixed PWM Mode and then disable 0x5A_VDD_VCORE_fixed_pwm_mode_enable 0x5A_VDD_VCORE_fixed_pwm_mode_disable

Here I disable phase 12, which hopefully doesn't do anything different since there aren't any phases enabled after this is done: 0x5A_VDD_VCORE_controller_phase12_disable

And here are the phase enables for phases 12-19: 0x5A_VDD_VCORE_controller_phase12_enable 0x5A_VDD_VCORE_controller_phase13_enable 0x5A_VDD_VCORE_controller_phase14_enable 0x5A_VDD_VCORE_controller_phase15_enable 0x5A_VDD_VCORE_controller_phase16_enable 0x5A_VDD_VCORE_controller_phase17_enable 0x5A_VDD_VCORE_controller_phase18_enable 0x5A_VDD_VCORE_controller_phase19_enable

ericaasen commented 1 year ago

Here is the same thing for the other rail on this controller (VDD_MEM_ABCD). 0x5A_VDD_MEM_ABCD_controller_phase3_disable 0x5A_VDD_MEM_ABCD_controller_phase2_disable 0x5A_VDD_MEM_ABCD_controller_phase1_disable 0x5A_VDD_MEM_ABCD_controller_phase0_disable

0x5A_VDD_MEM_ABCD_controller_phase3_enable 0x5A_VDD_MEM_ABCD_controller_phase2_enable 0x5A_VDD_MEM_ABCD_controller_phase1_enable 0x5A_VDD_MEM_ABCD_controller_phase0_enable

0x5A_VDD_MEM_ABCD_fixed_pwm_mode_enable 0x5A_VDD_MEM_ABCD_fixed_pwm_mode_disable

ericaasen commented 1 year ago

Since there is only one phase per rail on the ISL68224, only the transactions for Enable and Disable of Fixed PWM Mode are captured.

0x5C_VPP_ABCD_fixed_pwm_mode_enable 0x5C_VPP_ABCD_fixed_pwm_mode_disable

0x5C_VPP_EFGH_fixed_pwm_mode_enable 0x5C_VPP_EFGH_fixed_pwm_mode_disable

0x5C_V1P8_SP3_fixed_pwm_mode_enable 0x5C_V1P8_SP3_fixed_pwm_mode_disable

mkeeter commented 1 year ago

I did some poking at the RAA CSVs, with mixed results.

Here's a Python script that goes from CSV to DMA writes:

Here's a selection of the most plausible registers:

disable18
 4:E9C0 <= 000C0FF0
12:E9C1 <= 000C0FFF
21:E9C2 <= 0003F00F
25:E905 <= 00000F0A

disable17
12:E905 <= 00000F02
17:E9C2 <= 0001F00F
20:E9C1 <= 000E0FFF
22:E9C0 <= 000E0FF0

disable16
10:E9C1 <= 000F0FFF
16:E905 <= 00000F00
17:E9C2 <= 0000F00F
25:E9C0 <= 000F0FF0

disable15
 8:E9C2 <= 0000700F
21:E9C0 <= 000F8FF0
23:E904 <= 2A0000AA
24:E9C1 <= 000F8FFF
25:E905 <= 00000F00

E904-E905 appears to have a 2-bit field for each phase; I see one bit being cleared from each two-bit region when a phase is disabled, extending into E904 for phase 15.

E9C0 appears to have 1 bit set per phase disabled

E9C1 is the same as E9C1, except there's another F in the lowest nibble?

E9C2 has one bit cleared each time we disable a phase.

There's a bunch of stuff going on with E9D* that I don't understand:

disable18
2:E9D4 <= 0001B015
3:E9D2 <= 0000C006
5:E9DA <= 00000000
7:E9DB <= 00000000
9:E9D6 <= 0002A024
10:E9D9 <= 0001B015
11:E9DC <= 00000000
13:E9D3 <= 0001400D
15:E9D7 <= 0000C006
16:E9DD <= 00000000
17:E9DF <= 00000000
19:E9D5 <= 0002301C
20:E9DE <= 00000000
24:E9D8 <= 0001400D

disable17
1:E9D4 <= 0001B015
3:E9DD <= 00000000
5:E9D2 <= 0000C006
6:E9D9 <= 00000000
7:E9DB <= 00000000
9:E9D3 <= 0001400D
11:E9DF <= 00000000
13:E9DC <= 00000000
15:E9DA <= 00000000
16:E9DE <= 00000000
19:E9D8 <= 0001B015
21:E9D7 <= 0001400D
24:E9D6 <= 0000C006
26:E9D5 <= 0002301C

disable16
1:E9D3 <= 0001400D
3:E9D4 <= 0001B015
5:E9DA <= 00000000
7:E9D5 <= 0000C006
8:E9D6 <= 0001400D
11:E9DD <= 00000000
12:E9DC <= 00000000
13:E9D2 <= 0000C006
14:E9DB <= 00000000
15:E9DF <= 00000000
20:E9D9 <= 00000000
21:E9D7 <= 0001B015
22:E9DE <= 00000000
26:E9D8 <= 00000000

disable15
2:E9DC <= 00000000
4:E9DE <= 00000000
5:E9D9 <= 00000000
6:E9D4 <= 0000C006
11:E9D7 <= 00000000
12:E9D6 <= 0001B015
13:E9D3 <= 0001400D
14:E9DD <= 00000000
16:E9DB <= 00000000
18:E9DF <= 00000000
19:E9DA <= 00000000
20:E9D2 <= 0000C006
22:E9D5 <= 0001400D
27:E9D8 <= 00000000

This almost looks like it's using it as scratch memory (?), e.g. 0001B015 is written to a bunch of those registers.

disable18
2:E9D4 <= 0001B015
10:E9D9 <= 0001B015

disable17
1:E9D4 <= 0001B015
19:E9D8 <= 0001B015

disable16
3:E9D4 <= 0001B015
21:E9D7 <= 0001B015

disable15
12:E9D6 <= 0001B015

In general, there seems to be no rhyme or reason to the order in which registers are written.

ericaasen commented 1 year ago

w.r.t. addresses E904 and E905, those are known to be associated with the open pin detection, so I'd bet they are actually open pin detection fault mask registers and not the actual detection bits or maybe it's both. a bit in one place shows that the controller should pay attention to the open pin detection and the other bit shows the actual detection? but based on the fact that the open-pin detection register address they gave us for Gen2 did not change when the open pin state changed, I would lean more towards it being a fault mask.

ericaasen commented 1 year ago

NOTE

One thing to note when thinking about how we test this in production, if there is an open pin (at least on the phase pins), no rail within the ISL68224 will even attempt to turn on, so we will want to run the open pin detection before we try turning on individual phases, otherwise all 3 rails on the ISL68224 will fail because there will be no output voltage.

mkeeter commented 1 year ago

An annotated look at the enable / disabling of fixed PWM mode, from 0x5A_VDD_MEM_ABCD_fixed_pwm_mode_enable 0x5A_VDD_MEM_ABCD_fixed_pwm_mode_disable

Fixed PWM enable:

0210
    ON_OFF_CONFIG <= undocumented option

EA0D <= 00296230
    DMA operation

F0 40212010
    LOOPCFG 10202140 (byte swap)
        Bit 6: diode emulation enable
        Bit 8: minimum phase count = 1
        Bit 13: Reserved
        Bit 21: reserved
        Bit 28: Enable diode emulation for PS0/1

F0 00212010
    LOOPCFG 10202100
        Bit 8: minimum phase count = 1
        Bit 13: Reserved
        Bit 21: reserved
        Bit 28: Enable diode emulation for PS0/1

F0 00212000
    LOOPCFG 00202100
            Bit 8: minimum phase count = 1
            Bit 13: Reserved
            Bit 21: reserved

E9 0600
    PEAK_OCUC_COUNT <= 0006
        Number of consecutive switch cycles exceeding peak OC limit before fault = 6
        Number of consecutive switch cycles exceeding peak UC limit before fault = 0

F0 00212000
    LOOPCFG, same as above

EA5B <= 000007FE
    DMA register set

36 0080
    VIN_OFF <= 8000
        Sets V_IN OFF = -327680 mV

E932 <= 0038C5E0
    DMA register set

35 0080
    VIN_ON <= 8000
        Sets V_IN ON = -327680 mV

EA0D <= 00296231
    DMA register set

02 00
    ON_OFF_CONFIG <= 0
        force enables output

EA0D <= 00296231
    DMA register set

Fixed PWM disable:

0210
    ON_OFF_CONFIG <= undocumented option

EA0D <= 00296235
    DMA operation

F0 00212000
    LOOPCFG <= 00202100
        Bit 8: minimum phase count = 1
        Bit 13: Reserved
        Bit 21: reserved

F0 40212000
    LOOPCFG <= 00202140
        Bit 6: diode emulation enable
        Bit 8: minimum phase count = 1
        Bit 13: Reserved
        Bit 21: reserved

F0 40212010
    LOOPCFG <= 10202140
        Bit 6: diode emulation enable
        Bit 8: minimum phase count = 1
        Bit 13: Reserved
        Bit 21: reserved
        Bit 28: Enable diode emulation for PS0/1

E9 0606
    PEAK_OCUC_COUNT <= 0606
        Number of consecutive switch cycles exceeding peak OC limit before fault = 6
        Number of consecutive switch cycles exceeding peak UC limit before fault = 6
        (This is the default value)

F0 40212010
    LOOPCFG <= 10202140, same as above

EA5B <= 000007FE
    DMA operation
E932 <= 003EC5EF
    DMA operation

35 BC02
    VIN_ON <= 02BC
        Sets V_IN ON to 700 mV

36 F401
    VIN_OFF <= 01F4
        Sets V_IN OFF to 500 mV

02 1E
    Use configured TOFF_DELAY and TOFF_FALL settings
    Active high enable pin
    Enable requires enable pin AND OERATION command

EA0D <= 00296234
    DMA operation

The known PMBus operations all seem reasonable; I'm not sure if any of the mystery DMA operations are load-bearing here.

ericaasen commented 1 year ago

given the DMA accesses have different values between the two runs, I want a response from Renesas on those before we try it

ericaasen commented 1 year ago

In summary, the address map and required sequence to enable a specific rail is detailed below, with notes about what the registers are doing and why we want to set them this way as the guidance from Renesas was slightly lacking in this level of detail. Anything with a 4-byte address is a DMA access, anything with a 2-byte address is a regular PMBUS transaction.

All addresses can be assumed to be the same for either the RAA229618 or the ISL68224 controller unless specifically noted

NOTE: for any regular PMBUS transactions, make sure to set the page to the correct rail before performing the transaction

=========================================
Enable fixed PWM mode order of operations
=========================================
PWM Pulse Width
  Set to 0x133 for 50ns
Rail 0: 0xEA31
Rail 1: 0xEAB1
Rail 2: 0xEB31

0xF0 - loop_cfg - Read-Modify-Write
--disable diode emulation mode everywhere by setting bits 6 and 28 to 0

0x09 - phase_current_limit_count
-set to 0x00 06
--set per-phase undercurrent behavior to limit output current instead of faulting after 6 events and leave output overcurrent limit set to fault after 6 events
--value of over/under current limits are set in registers 0xCD and 0xCE, respectively in 0.1A/LSB in 2's complement.  default is 60A and -60A, respectively
--by disabling the output undercurrent fault on phases, we can see if the upper MOSFET is working which shows up as a large negative current when enabled and seems to be mainly caused by either the bootstrap power supply being damaged or one of the three 5V bias pins on the power stage aren't connected correctly

Fixed Pulse Width Enable
  Read-Modify-Write 0x1 to enable
Rail 0: 0xEA0D
Rail 1: 0xEA8D
Rail 2: 0xEB0D
--On this register, bit 2 sets the ripple regulator to fixed-frequency PWM and bit 0 enables fixed PWM mode

==========================================
Disable fixed PWM mode order of operations
==========================================
Fixed Pulse Width Disable
  Read-Modify-Write 0x4 to disable
Rail 0: 0xEA0D
Rail 1: 0xEA8D
Rail 2: 0xEB0D

0xF0 - loop_cfg - Read-Modify-Write
--enable diode emulation mode everywhere by setting bits 6 and 28 to 1

0x09 - phase_current_limit_count
-set to 0x06 06
--set per-phase undercurrent behavior to fault after 6 over/undercurrent events

The one thing we might have to add, based on testing, is whether we copy some of the rail fault register values, but I can't find where in PN to modify those, so I can't decode all of the bits in the register. The DMA register is 0xE952 for Rail 0 on ISL68224. PN disables input and output voltage faults and sets min and max voltages for things like the VMON pin to their absolute max values so that will never trip either.

If we see that we can't turn on a rail because it's having an output UV fault, we can try to dig into that more

ericaasen commented 1 year ago

when trying this on a Gimlet, the rail seems to not be happy. @mkeeter enabled VDD_MEM_EFGH phase 0 to be the only phase enabled and I saw there was a blackbox event and pulled the info:

eric@niles ~ $ pfexec humility -t gimlet-b-matt rendmp --blackbox --device 0x5B
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
rail0 uptime: 31.6 sec
rail1 uptime: 31.6 sec
controller fault: 00000000000000000000000100000000 ()
rail0 fault: 00000000000000000000000000000000 ()
rail1 fault: 00000000000000000000000000000000 ()
phase fault uc: 00000000000000000000000000000000 ()
phase fault oc: 00000000000000000000000000000000 ()
adc fault uc: 00000000000000000000000000000000 ()
adc fault oc: 00000000000000000000000000000000 ()
rail0 status: 0001100001000011 MFR_SPECIFIC | POWER_GOOD# | off | CML | none of the above
rail1 status: 0001100001000011 MFR_SPECIFIC | POWER_GOOD# | off | CML | none of the above
status cml: 00001000 VIN_UV_FAULT
status mfr: 00001000 BBEVENT
rail1 status vout: 00000000 ()
rail0 status vout: 00000000 ()
rail1 status iout: 00000000 ()
rail0 status iout: 00000000 ()
rail1 status temperature: 00000000 ()
rail0 status temperature: 00000000 ()
rail1 status input: 00000000 ()
rail0 status input: 00000000 ()

     | RAIL 0  | RAIL 1
-----|---------|-----------
VIN  | 12.00 V | 12.00 V
VOUT | 0.000 V | 0.000 V
IIN  | 0.00 A  | 0.00 A
IOUT | 0.0 A   | 0.0 A
TEMP | 24°C    | 24°C
controller read temperature: 18°C

 PHASE | TEMPERATURE | CURRENT
-------|-------------|----------
 0     | 0°C         | 0.0 A
 1     | 0°C         | 0.0 A
 2     | 0°C         | 0.0 A
 3     | 0°C         | 0.0 A
 4     | 0°C         | 0.0 A
 5     | 0°C         | 0.0 A
 6     | 0°C         | 0.0 A
 7     | 0°C         | 0.0 A
 8     | 0°C         | 0.0 A
 9     | 0°C         | 0.0 A
 10    | 0°C         | 0.0 A
 11    | 0°C         | 0.0 A
 12    | 0°C         | 0.0 A
 13    | 0°C         | 0.0 A
 14    | 0°C         | 0.0 A
 15    | 0°C         | 0.0 A
 16    | 0°C         | 0.0 A
 17    | 0°C         | 0.0 A
 18    | 0°C         | 0.0 A
 19    | 24°C        | 0.0 A

The VIN_UV_FAULT is interesting because it doesn't show up in the normal fault registers, but in the normal fault registers, there is something called ProcessorFault that is flagged:

0x7e STATUS_CML                0x88
     |
     | b7     0b1 = invalid command(s)       <= InvalidCommand
     | b6     0b0 = no invalid data          <= InvalidData
     | b5     0b0 = not failed               <= PECFailed
     | b4     0b0 = no fault                 <= MemoryFault
     | b3     0b1 = fault                    <= ProcessorFault
     | b1     0b0 = no error                 <= OtherCommunicationError
     | b0     0b0 = no error                 <= OtherMemoryLogicError
     +-----------------------------------------------------------------------

This might be related to the mystery setting of railFltEn1_vinUnderVolt fault setting changes that PN was showing but don't really seem to be able to be changed independently in the GUI

ericaasen commented 1 year ago

one of the issues was we weren't setting ON_OFF_CONFIG to 0x00 (which would ignore all control from the pin or PMBUS and always have the rail enabled). This was in the output from PN, but I did not understand the significance