oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3k stars 172 forks source link

Improvements to transceiver temperature handling #1081

Open mkeeter opened 1 year ago

mkeeter commented 1 year ago

quoth @kc8apf

  • reading max case temperature to set limit
  • checking Data_Not_Ready and reporting no temp available if set
  • interpreting free side temperate of 0C as no temperate and hoping that no one operates a module at precisely that temperature.
Aaron-Hartwig commented 1 month ago

Additionally, we could also just remove modules from the loop which do not support temperature monitoring anyway. Our current logic just reads the relevant bytes from the module without taking into account if those are valid or not.

For SFF-8636 we would need to qualify our read of the free side temp monitors (lower page bytes 22/23) with if that monitoring is actually supported (upper page 0 byte 220 bit 5).

For CMIS we need to qualify our read of the temperature monitor (lower page bytes 14/15) with if that monitoring is actually supported (upper page 1 byte 159 bit 0).

mkeeter commented 1 month ago

If we can get some of the misbehaving transceivers into a bench Sidecar, it should be pretty easy to test this out.

Aaron-Hartwig commented 1 month ago

Looks like we have many options on niles! Anywhere xcvradm marks a field with -- that indicates the field is not supported on that module.

aaron@niles ~ $ ./xcvradm -i axf7 -t present vendor-info
Port Identifier               Vendor           Part             Rev  Serial           Mfg date
   0 Qsfp28 (0x11)            FS               QSFP28-SR4-100G  04   G2130484857      20220321
   2 QsfpPlusCmis (0x1e)      Intel Corp       SPTSMP3CLCDA     03   CRFR2141020JP    21101500
   3 QsfpPlusCmis (0x1e)      Intel Corp       SPTSMP3CLCDA     03   CRFR213905JEP    21101800
   4 QsfpPlusCmis (0x1e)      FINISAR CORP.    FTCC1112E2PCL    A    X65BPQR          210901
   5 Qsfp28 (0x11)            FS               QSFP28-SR4-100G  1A   F2220590150      220615
   8 QsfpPlusCmis (0x1e)      FINISAR CORP.    FTCC1112E2PCL    A    X6QA1JC          220305
  16 Qsfp28 (0x11)            FS               QSFP28-SR4-100G  04   G2130484856      20220321
  24 Qsfp28 (0x11)            Intel Corp       AMQ28-SR4        01   IN100MC0040      221206
  aaron@niles ~ $ ./xcvradm -i axf7 -t present monitors
Port 0
         Temperature (C): --
      Supply voltage (V): --
       Avg Rx power (mW): [0.6940,0.5875,0.6929,0.5592]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 2
         Temperature (C): 30.992188
      Supply voltage (V): 3.3943
       Avg Rx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 3
         Temperature (C): 29.847656
      Supply voltage (V): 3.4041
       Avg Rx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 4
         Temperature (C): 28.601563
      Supply voltage (V): 3.3572998
       Avg Rx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 5
         Temperature (C): --
      Supply voltage (V): --
       Avg Rx power (mW): [0.0000,0.0000,0.0000,0.0000]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 8
         Temperature (C): 27.527344
      Supply voltage (V): 3.3665
       Avg Rx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001,0.0000,0.0000,0.0000,0.0000]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 16
         Temperature (C): --
      Supply voltage (V): --
       Avg Rx power (mW): [0.0001,0.0001,0.0001,0.0001]
            Tx bias (mA): [0.0000,0.0000,0.0000,0.0000]
           Tx power (mW): [0.0001,0.0001,0.0001,0.0001]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --

Port 24
         Temperature (C): --
      Supply voltage (V): --
       Avg Rx power (mW): [0.0001,0.0001,0.0001,0.0001]
            Tx bias (mA): [5.7040,5.6980,5.6780,5.7180]
           Tx power (mW): [0.9602,0.8259,1.0517,1.0618]
                   Aux 1: --
                   Aux 2: --
                   Aux 3: --
mkeeter commented 1 month ago

There's some weirdness going on here; notice that all of the SFF-8636 transceivers aren't reporting temperature, and all of the CMIS transceivers are!

For the SFF-8636 transceivers, xcvradm is looking at upper page 0, byte 220, bit 5 (per the spec)

Screenshot 2024-09-20 at 2 48 39 PM

All of our transceivers are reporting 0x0c:

matt@niles ~ () $ ./xcvradm -i axf7 -t0,5,16,24 read-upper --page 0 --sff 220 1
Port Data
   0 [0x0c]
   5 [0x0c]
  16 [0x0c]
  24 [0x0c]

Bit 5 is not set, so they are claiming to not support temperature readings.

However, they all also provide perfectly valid temperature values:

matt@niles ~ () $ ./xcvradm -i axf7 -t0,5,16,24 read-lower --sff 22 2
Port Data
   0 [0x1b,0x66] # 27.37°C
   5 [0x18,0x7b] # 24.48°C
  16 [0x1a,0x48] # 26.28°C
  24 [0x1e,0x5f] # 30.37°C

I'm a little mystified here. Do we have any SFF-8363 transceivers that claim to support temperature monitoring?

nathanaelhuffman commented 1 month ago

image

This table would make me thing that temp at least is required for SM-type devices which is what all of our non-dac, non-active-optical modules are.

Maybe they're "pre-rev 2.8?" or maybe the monitoring is referring to some other features like over temp alert kinds of things?

mkeeter commented 1 month ago

Maybe they're "pre-rev 2.8?" or maybe the monitoring is referring to some other features like over temp alert kinds of things?

I checked the version theory earlier, and all but 1 of them are returning a version number that means rev 2.8, 2.9, 2.10:

matt@niles ~ () $ ./xcvradm -i axf7 -t0,5,16,24 read-lower --sff 1 1
Port Data
   0 [0x08]
   5 [0x08]
  16 [0x08]
  24 [0x07]

Screenshot 2024-09-20 at 2 07 55 PM

Good catch on §6.2.4. It sure looks like temperature monitoring should be required for SM modules; it's just unfortunate that the diagnostic monitor bitfield doesn't reflect that...

Aaron-Hartwig commented 2 weeks ago

@mkeeter the sus module I ordered finally arrived and is now installed in port 13 on the niles sidecar. Not urgent, just updating the ticket for future us.