sinara-hw / booster-firmware

Other
4 stars 3 forks source link

fix temperature conversion #7

Open hartytp opened 4 years ago

hartytp commented 4 years ago

https://github.com/sinara-hw/Booster/issues/338#issuecomment-592215237

bjuszczyk commented 4 years ago

Hi Tom,

could you please clarify what is wrong with temperature conversion. I looked at the code and it seems fine for me.

i.e. with the current code 0xC0 (0.75C) gives 1.5C. So, a 0.25C change from 0.75C to 1C would cause the temperature to decrease by 0.5C. Similarly, a change from 0C to 0.25C would change the calculated temperature from 0C to 1C. That likely explains the crazy statistics I saw...

Datasheet says:

Extended Temperature Registers (05h and 06h) These registers contain the extended temperature data from channels 1 and 2. Bits D[7:5] contain the 3 LSBs of the temperature data. The bit values are 0.5°C, 0.25°C, and 0.125°C. When bit 0 is set to 1, a diode fault has been detected.

As you can notice it stands that the bit values are as follows 0.5, 0.25, 0.125. So in fact if you have i.e. 0xC0 it means that you should add 0.5 and 0.25 C which is 0.75C in total.

hartytp commented 4 years ago

As you can notice it stands that the bit values are as follows 0.5, 0.25, 0.125. So in fact if you have i.e. 0xC0 it means that you should add 0.5 and 0.25 C which is 0.75C in total.

Yes, if you have 0XC0 you should add 0.5C+0.25C=0.75C.

image

However, that's now what the current code does....

    total = (float) temp;
    if (temp_ext & 0x40) total += 0.250;
    if (temp_ext & 0x80) total += 0.500;
    if (temp_ext & 0xC0) total += 0.750;

if temp_ext is 0XC0 then total=0.25+0.5+0.75=1.5C which is not correct.

i.e. the current code is not the same as

    total = (float) temp;
    if ((temp_ext & 0x40) == 0x40) total += 0.250;
    if ((temp_ext & 0x80) == 0X80) total += 0.500;
    if ((temp_ext & 0xC0) == 0xC0) total += 0.750;

or, more succinctly,

total += (float)(temp_ext >> 6)*0.25;
hartytp commented 4 years ago

Extended Temperature Registers (05h and 06h) These registers contain the extended temperature data from channels 1 and 2. Bits D[7:5] contain the 3 LSBs of the temperature data. The bit values are 0.5°C, 0.25°C, and 0.125°C. When bit 0 is set to 1, a diode fault has been detected.

Where is that. In the datasheet I see

image

image

bjuszczyk commented 4 years ago

Where did you find the code that you posted?

Actually function looks different:

float max6639_get_temperature(uint8_t addr, uint8_t ch)
{
    float total = 0.0;
    uint8_t temp, temp_ext = 0;

    vPortEnterCritical();

    i2c_read(MAX6639_I2C, addr, MAX6639_REG_TEMP(ch), &temp);
    i2c_read(MAX6639_I2C, addr, MAX6639_REG_TEMP_EXT(ch), &temp_ext);

    vPortExitCritical();

    total = (float) temp;
    if (temp_ext & 0x80) total += 0.50;
    if (temp_ext & 0x40) total += 0.25;
    if (temp_ext & 0x20) total += 0.125;

    return total;
}
hartytp commented 4 years ago

https://github.com/sinara-hw/booster-firmware/blob/2eaa87dde084d87ccefecccb32736c0b8b1ed901/Source/devices/max6642.c#L29

hartytp commented 4 years ago

The temperature sensors on each channel are a 6642, not a 6639

image

bjuszczyk commented 4 years ago

Ok, my bad ;) Give me a minute i will look at the code.

hartytp commented 4 years ago

Searching the sources quickly, I can't see anywhere that max6639_get_temperature is called, so maybe this is dead code?

@gkasprow do we use a 6639 anywhere?

hartytp commented 4 years ago

@bjuszczyk the 6639 is used on the main board

image

hartytp commented 4 years ago

@bjuszczyk AFAICT the MAX6639.c is dead code that is no longer used. I think (from a quick look, but i might be wrong) that the temp_mgt code is actually what controls the MAX6639.

Based on this code, it looks like we never actually read out the temperature of the chassis (MAX6639), only the channels. Not sure if this is an oversight or deliberate...

While you're fixing the 6642 conversion, it might be worth tidying up/documenting (and deleting dead code) the rest of the temperature management code because right now it's pretty hard to understand how it is supposed to function (this should ultimately also be documented in the Wiki)

hartytp commented 4 years ago

Why is RF channel enable in the temp management code?? That seems like pretty unexpected behavior (functionality exceeds what one would reasonably expect from the function name)

https://github.com/sinara-hw/booster-firmware/blob/ade44e9ae6dd3dd36f435e10f69d60876c4fdddd/Source/board/temp_mgt.c#L132

hartytp commented 4 years ago

The other thing I would comment here is that the moving average should probably be removed.

https://github.com/sinara-hw/booster-firmware/blob/ade44e9ae6dd3dd36f435e10f69d60876c4fdddd/Source/board/temp_mgt.c#L21

The noise on the temperature sensor without averaging should only be +-0.25C, which is fine. I suspect that the averaging was implemented because the noise was much larger than expected because of other bugs. It's better to fix those bugs than to try to hide the symptoms with averaging...

bjuszczyk commented 4 years ago

Ok, i have spotted similar conversion at the code that was implemented for MAX6639. That's why i was a bit confused. The section that you are pointing has a bug. I can implement it i a different way and that should solve the issue:

    if      (temp_ext & 0xC0) total += 0.750;
    else if (temp_ext & 0x80) total += 0.500;
    else if (temp_ext & 0x40) total += 0.250;

I don't know the reason for enabling the channels in this section.

I think that the average value is calculated for fan speed control. It doesn't affect the readouts.

hartytp commented 4 years ago

I can implement it i a different way and that should solve the issue:

Yes, that would work. I still think this way is easier to follow though :)

total += (float)(temp_ext >> 6)*0.25;

I think that the average value is calculated for fan speed control. It doesn't affect the readouts.

Aah, I see. Yes, for the fans I can see why that would be helpful. Thanks for explaining!

hartytp commented 4 years ago

I don't know the reason for enabling the channels in this section.

I assume it's just historical. It's not a major issue, but I think that overall the code could use a restructure. I suspect that the process of tidying up/restructuring would fix quite a few bugs. Some of the current code paths are quite convoluted and hard to understand making debugging tricky..

bjuszczyk commented 4 years ago

Yes, it could fix some issues. But still its very complex and extensive code, thus it requires massive amount of work to do this.

Ok, so i will commit new changes in a minute. Please verify it it helps for the temperatures issue.

hartytp commented 4 years ago

Ok, so i will commit new changes in a minute. Please verify it it helps for the temperatures issue.

Thanks! Can you publish a firmware RC as well (in the releases section)?

bjuszczyk commented 4 years ago

Yes i can add RC. What number should i put for this version? Is there any algorithm or sth?

hartytp commented 4 years ago

Good question. I'm not sure, you'd have to ask @wizath NB we should also update the fw version string to match.

We're currently on v1.4.1. i think there has been enough work done to warrant moving to v1.5rc1 or something like that