SPDIF algorithm - Githubissues

masterxq commented 1 year ago

Hey bois*, Tonight I had time to learn spdif, and I wrote an optimized implementation of the spdif convertion algorithm. As it not matches all your requirements (only 4 byte samples) I would like to post it here, so you have the option to use it.

I measured the speed of multiple algorithm, currently this on is the winner. This was the test conditions:

25000 interations
1024 byte input buffer (256 16bit stereo samples).
Test environment: ESP32
Optimization: -O2

Results:

Philippe44 algorithm needs: 88 us
Another algorithm I found on the net and optimized slightly, before I understood spdif: 73 us
MasterQ (this one) algorithm: 56 us

Additionally, this algorithm fixes some problems:

passes the validation
no noise on the last 4 bits
much more straight (handles sample by sample)
really sets the validation bit instead of hack LSBs
is faster

It brings one new potential problem:

Only for one instance as local static variables are used
8 byte samples not implemented

Both points should be easy to change or implement.

Phillippe44 algorithm did not parse the verification, what most likely was An mistake of me by reimplementing it to my test application. But maybe this should be checked, I will attach my validation code.

E (5080) TestProject: Failed dword3 or dword4, halfsample: 511 bit: 0
I (5090) TestProject: V_U_C_P_P_P_P_P_ A_A_A_A_S_S_S_S_ S_S_S_S_S_S_S_S_ S_S_S_S_S_S_S_S_ V_U_C_P_P_P_P_P_ A_A_A_A_S_S_S_S_ S_S_S_S_S_S_S_S_ S_S_S_S_S_S_S_S_
I (5110) TestProject: 1100110011100100 1100110011010101 1100110011001100 1100110011001100 1100110011100010 1100110011001100 1100110011001100 1100110011001100
                                                      ^ ^^

What was tested:

Tested to play sound on a real device (passed)
Tested the data with a very simple spdif validator (passed)
Tested the spdif validator with other algorithm (2 of 3 passed)
Chunk test with real device and validator

What was not tested:

Decode the data and compare with the original data
Long time tests

Fell free to use this code as and where you want. If you have a good day, then mention where it comes from :)

I will post here if i find issues or have optimizations.

Here is the code

#define VUCP_PREAMBLE_B_16_32 0xCCE80000
#define VUCP_PREAMBLE_M_16_32 0xCCE20000
#define VUCP_PREAMBLE_W_16_32 0xCCE40000

//For having a pretictable state the entries allways starts
//with first bit set. This is ok as the phase does not matter
//in the bmc encoding!
uint16_t spdif_bmclookup_1first[256] =
{
    0xCCCC, 0xB333, 0xD333, 0xACCC, 0xCB33, 0xB4CC, 0xD4CC, 0xAB33,
    0xCD33, 0xB2CC, 0xD2CC, 0xAD33, 0xCACC, 0xB533, 0xD533, 0xAACC,
    0xCCB3, 0xB34C, 0xD34C, 0xACB3, 0xCB4C, 0xB4B3, 0xD4B3, 0xAB4C,
    0xCD4C, 0xB2B3, 0xD2B3, 0xAD4C, 0xCAB3, 0xB54C, 0xD54C, 0xAAB3,
    0xCCD3, 0xB32C, 0xD32C, 0xACD3, 0xCB2C, 0xB4D3, 0xD4D3, 0xAB2C,
    0xCD2C, 0xB2D3, 0xD2D3, 0xAD2C, 0xCAD3, 0xB52C, 0xD52C, 0xAAD3,
    0xCCAC, 0xB353, 0xD353, 0xACAC, 0xCB53, 0xB4AC, 0xD4AC, 0xAB53,
    0xCD53, 0xB2AC, 0xD2AC, 0xAD53, 0xCAAC, 0xB553, 0xD553, 0xAAAC,
    0xCCCB, 0xB334, 0xD334, 0xACCB, 0xCB34, 0xB4CB, 0xD4CB, 0xAB34,
    0xCD34, 0xB2CB, 0xD2CB, 0xAD34, 0xCACB, 0xB534, 0xD534, 0xAACB,
    0xCCB4, 0xB34B, 0xD34B, 0xACB4, 0xCB4B, 0xB4B4, 0xD4B4, 0xAB4B,
    0xCD4B, 0xB2B4, 0xD2B4, 0xAD4B, 0xCAB4, 0xB54B, 0xD54B, 0xAAB4,
    0xCCD4, 0xB32B, 0xD32B, 0xACD4, 0xCB2B, 0xB4D4, 0xD4D4, 0xAB2B,
    0xCD2B, 0xB2D4, 0xD2D4, 0xAD2B, 0xCAD4, 0xB52B, 0xD52B, 0xAAD4,
    0xCCAB, 0xB354, 0xD354, 0xACAB, 0xCB54, 0xB4AB, 0xD4AB, 0xAB54,
    0xCD54, 0xB2AB, 0xD2AB, 0xAD54, 0xCAAB, 0xB554, 0xD554, 0xAAAB,
    0xCCCD, 0xB332, 0xD332, 0xACCD, 0xCB32, 0xB4CD, 0xD4CD, 0xAB32,
    0xCD32, 0xB2CD, 0xD2CD, 0xAD32, 0xCACD, 0xB532, 0xD532, 0xAACD,
    0xCCB2, 0xB34D, 0xD34D, 0xACB2, 0xCB4D, 0xB4B2, 0xD4B2, 0xAB4D,
    0xCD4D, 0xB2B2, 0xD2B2, 0xAD4D, 0xCAB2, 0xB54D, 0xD54D, 0xAAB2,
    0xCCD2, 0xB32D, 0xD32D, 0xACD2, 0xCB2D, 0xB4D2, 0xD4D2, 0xAB2D,
    0xCD2D, 0xB2D2, 0xD2D2, 0xAD2D, 0xCAD2, 0xB52D, 0xD52D, 0xAAD2,
    0xCCAD, 0xB352, 0xD352, 0xACAD, 0xCB52, 0xB4AD, 0xD4AD, 0xAB52,
    0xCD52, 0xB2AD, 0xD2AD, 0xAD52, 0xCAAD, 0xB552, 0xD552, 0xAAAD,
    0xCCCA, 0xB335, 0xD335, 0xACCA, 0xCB35, 0xB4CA, 0xD4CA, 0xAB35,
    0xCD35, 0xB2CA, 0xD2CA, 0xAD35, 0xCACA, 0xB535, 0xD535, 0xAACA,
    0xCCB5, 0xB34A, 0xD34A, 0xACB5, 0xCB4A, 0xB4B5, 0xD4B5, 0xAB4A,
    0xCD4A, 0xB2B5, 0xD2B5, 0xAD4A, 0xCAB5, 0xB54A, 0xD54A, 0xAAB5,
    0xCCD5, 0xB32A, 0xD32A, 0xACD5, 0xCB2A, 0xB4D5, 0xD4D5, 0xAB2A,
    0xCD2A, 0xB2D5, 0xD2D5, 0xAD2A, 0xCAD5, 0xB52A, 0xD52A, 0xAAD5,
    0xCCAA, 0xB355, 0xD355, 0xACAA, 0xCB55, 0xB4AA, 0xD4AA, 0xAB55,
    0xCD55, 0xB2AA, 0xD2AA, 0xAD55, 0xCAAA, 0xB555, 0xD555, 0xAAAA
};
/*
 * The algorithm is packing the data like the this:
 * (...) = single write
 * |     = 16bit-words orignal spdif
 * PPPP SSSS | SSSS) (SSSS | SSSS) (SSSS | SSSS) (VUCP
 * So partly it packed in 16 bit (input samples)
 * and partly it takes all static data for a single
 * 32 bit write.
 * To implement 24bit samples i would recommend to split
 * the 32bit write up into 16bit writes again (it would be
 * minimal slower) and follow the schema for the other 2
 * byte writes.
 * The lookup bmc lookup table is modified, every entry
 * is starting with '1' bit.
 * If the last bit of the previous entry is a '1' we need
 * to toggle the next data loopup. To be fast we do this
 * with a lookup table too: "xor_lut".
 * Also for the static data " there are only 2 possiblites,
 * depending on the ending bit of the previous data.
 * Instead of calculate this, i did put it into a lookup table
 * again.
 * In both options for the static data the last bit of parity
 * should be 0 (10 or 00). If we ensure this we allready have
 * found the correct parity bit value, as the logic of the
 * rest defines it. If an block ends with 0 the parity is 0
 * and if a block end with 1 the total parity is 1!
 * Depending on this the next bmc data will be not inverted or inverted.
 * This accumulates and is allways correct! So we can
 * hardcode it into the 2 static data entries. 
 * I think this algorithm is much more straigt and more
 * like it was intended by the dsp designer.
 * Additional it is about 20% faster than the second
 * fast algorithm i could find! 60 us (this one) against
 * 88 us (other algorithm).
 * Additional last random bytes are now also in the 16bit sample
 * stream, statically set to 0!
 * Moving the 512 bytes lut from const mem to ram, gives 2us
 * But care the static variables will not work for multiple
 * instances of spdif
*/
void spdif_masterq(char *buffer, int len, uint32_t *target_buf)
{
    static uint8_t frame_num = 0;

    uint16_t *target_buf_16 = (uint16_t *)target_buf;

    uint32_t VUCP_PREAMBLE_BIT20_24_B_32[2] =
        {(VUCP_PREAMBLE_B_16_32 | 0xCCCC), ((VUCP_PREAMBLE_B_16_32 ^ 0xFF000000) | 0xCCCC) & 0xFEFFFFFF};
    uint32_t VUCP_PREAMBLE_BIT20_24_M_32[2] =
        {(VUCP_PREAMBLE_M_16_32 | 0xCCCC), ((VUCP_PREAMBLE_M_16_32 ^ 0xFF000000) | 0xCCCC) & 0xFEFFFFFF};
    uint32_t VUCP_PREAMBLE_BIT20_24_W_32[2] =
        {(VUCP_PREAMBLE_W_16_32 | 0xCCCC), ((VUCP_PREAMBLE_W_16_32 ^ 0xFF000000) | 0xCCCC) & 0xFEFFFFFF};

    static uint8_t vucp_idx = 0;
    uint8_t vucp_idx_local = vucp_idx;
    uint16_t xor_lut[2] = {0x0000, 0xFFFF};
    uint16_t hi, lo;
    len = len/4;
    while(len--)
    {
        if (++frame_num > 191)
        {
            *(uint32_t *)target_buf_16 = VUCP_PREAMBLE_BIT20_24_B_32[vucp_idx_local];
            target_buf_16+=2;
            frame_num = 0;
        }
        else
        {
            *((uint32_t *)target_buf_16) = VUCP_PREAMBLE_BIT20_24_M_32[vucp_idx_local];
            target_buf_16+=2;
        }

        lo = spdif_bmclookup_1first[(uint8_t)*buffer++];
        hi = spdif_bmclookup_1first[(uint8_t)*buffer++] ^ xor_lut[lo & 1];
        *target_buf_16++ = hi;
        *target_buf_16++ = lo;

        //Next half_samples
        *((uint32_t *)target_buf_16) = VUCP_PREAMBLE_BIT20_24_W_32[hi & 1];
        target_buf_16+=2;

        lo = spdif_bmclookup_1first[(uint8_t)*buffer++];
        hi = spdif_bmclookup_1first[(uint8_t)*buffer++] ^ xor_lut[lo & 1];
        *target_buf_16++ = hi;
        *target_buf_16++ = lo;

        //Remember the last bit of the last data in a static variable to
        //be able to continue every time :)
        vucp_idx_local = hi & 1;
    }
    vucp_idx = vucp_idx_local;
}

The validator.

void print_bits(char *target_buf, const uint32_t *data, int num)
{
    for(uint8_t byte = 0; byte < num; byte++)
    {
        uint32_t out_byte = *data;
        for(uint8_t i = 0; i < 32; )
        {
            *target_buf = (out_byte & 0x80000000) ? '1' : '0';
            out_byte <<= 1;
            target_buf++;
            i++;
            if(i % 16 == 0)
            {
                *target_buf = ' ';
                target_buf++;
            }
        }
        data++;
    }
    *target_buf = '\0';
}

void print_sample(uint32_t *data)
{
    char buffer[300];
    print_bits(buffer, data, 4);
    ESP_LOGI(TAG, "V_U_C_P_P_P_P_P_ A_A_A_A_S_S_S_S_ S_S_S_S_S_S_S_S_ S_S_S_S_S_S_S_S_ V_U_C_P_P_P_P_P_ A_A_A_A_S_S_S_S_ S_S_S_S_S_S_S_S_ S_S_S_S_S_S_S_S_");
    ESP_LOGI(TAG, "%s", buffer);
}

//Lets check some "mono" samples
bool validate_bit_toogle(uint32_t *half_samples, int num)
{
    int j = 0;
    bool expected = 1;
    bool bit_set = 0;
    uint32_t *sample_ptr = half_samples;
    int pb_rest = -1;
    int pm_rest = -1;
    int pw_rest = -1;
    while(j < num/2)
    {
        uint32_t sample = *sample_ptr << 16;
        //Check A_A_A_A_S_S_S_S_
        for(uint8_t i = 0; i < 8; i++)
        {
            bit_set = (sample & 0x80000000);
            if(bit_set != expected)
            {
                ESP_LOGE(TAG, "Failed dword2");
                return false;
            }
            sample <<= 1;
            bit_set = (sample & 0x80000000);
            expected = !bit_set;
            sample <<= 1;
        }
        sample_ptr++;

        //check 2x S_S_S_S_S_S_S_S_
        sample = *sample_ptr;
        for(uint8_t i = 0; i < 16; i++)
        {
            bit_set = (sample & 0x80000000);
            if(bit_set != expected)
            {
                ESP_LOGE(TAG, "Failed dword3 or dword4, halfsample: %d bit: %d", (int)(sample_ptr - half_samples), i);
                print_sample(half_samples + (j*2));
                return false;
            }
            sample <<= 1;
            bit_set = (sample & 0x80000000);
            expected = !bit_set;
            sample <<= 1;
        }
        sample_ptr++;

        j++;
        if(j >= num/2)
        {
            ESP_LOGI(TAG, "Check completed");
            return true;
        }

        //check vucp
        sample = *sample_ptr;
        for(uint8_t i = 0; i < 4; i++)
        {
            bit_set = (sample & 0x80000000);
            if(bit_set != expected)
            {
                ESP_LOGE(TAG, "Failed dword0 wucp sample: %d", j);
                print_sample(half_samples + ((j - 1)*2));
                print_sample(half_samples + (j*2));
                return false;
            }
            sample <<= 1;
            bit_set = (sample & 0x80000000);
            expected = !bit_set;
            sample <<= 1;
        }

        //Check preamble
        uint8_t preamble = sample >> 24;
        if(preamble == 0xE8) //B
        {
            if(pb_rest < 0)
            {
                pb_rest = j%384;
            }
            if(j%384 != pb_rest)
            {
                ESP_LOGE(TAG, "Found broken preamble B");
                return false;
            }
        }
        else if(preamble == 0xE2) //M
        {
            if(pm_rest < 0)
            {
                pm_rest = j%2;
            }
            if(j%2 != pm_rest)
            {
                ESP_LOGE(TAG, "Found broken preamble J. rest should: %d, but is: %d, sample: %d", pm_rest, j%4, j);
                return false;
            }
            if(pb_rest >= 0 && j%384 == pb_rest)
            {
                ESP_LOGE(TAG, "This should be a pb_rest");
                return false;
            }
        }
        else if(preamble == 0xE4) //W
        {
            if(pw_rest < 0)
            {
                pw_rest = j%2;
            }
            if(j%2 != pw_rest)
            {
                ESP_LOGE(TAG, "Found broken preamble Wrest should: %d, but is: %d, sample: %d", pw_rest, j%2, j);
                return false;
            }
            if(pb_rest >= 0 && j%384 == pb_rest)
            {
                ESP_LOGE(TAG, "This should be a pb_rest");
                return false;
            }
        }
        else
        {
            ESP_LOGE(TAG, "preamble broken: 0x%02X", preamble);
            return false;
        }

        if(!expected)
        {
            ESP_LOGE(TAG, "wrong expection for next bit");
            return false;
        }

        //TODO: remove this
        expected = true;
    }
    if(pw_rest < 0 || pb_rest < 0 || pm_rest < 0)
    {
        //This dont has to be an error for small datasets!
        ESP_LOGW(TAG, "Did not see all preamble, problem is possible");
        return false;
    }
    ESP_LOGW(TAG, "Did not expect to reach this");
    return true;
}

Hopefully this helps somebody!

Best regards and happy hacking!

sle118 commented 1 year ago

Thank you for your contribution. Hopefully this will be something that Philippe will take the time to consider, when he has chance

philippe44 commented 1 year ago

Thanks, seems great - have you tried it with the complete squeezelite-esp32 to measure the global cpu gain? I've not checked how much this loops weights in the cpu load vs the fact that we have to 4x i2s rate

masterxq commented 1 year ago

I'm very sorry, I have no environment for testing it and have very limited knowledge about your project, just read the i2s/spdif part, what really was helpful by understanding what I have to do... I have written my own libs and using some adf libs. What maybe is partly a bit sad, because I think there is high quality code in your project available, and I did fix the same issues as you did. But the problem remains the same, I have to know what issues exist before they can be fixed, after understanding the issues, is often easier to write the solution than search and integrate existing solutions. And it's less difficult to fix upcoming issues ^^

But what I can do, is sending you my complete testing app, then you can implement your algorithm correctly or make the test function respect the offset in your generated data and handle it correctly, what probably is the concern with the failed validation additional you can test the performance of upcoming code. The app is just quick and dirty but meets the requirements, only want to do basic tests, as this is only 1% of my project ^^

If you want we can stay in contact to avoid redundant work in the future :)

My focus are hardware solutions, and I'm facing audio hardware from time to time.

masterxq commented 1 year ago

Thanks, seems great - have you tried it with the complete squeezelite-esp32 to measure the global cpu gain? I've not checked how much this loops weights in the cpu load vs the fact that we have to 4x i2s rate

Sorry my other answer is still correct, but did reread, and I have at least some information about the CPU usage, it needs about 10-12% (freertos taskmanager) of a single core in my application, but I'm still using i2s_write, what copies the buffer once more, what is not needed as long I remember from another project, it should be possible to just queue some buffers without copy them once more and directly handle them over to dma (no cpu), what makes much more sense. I can't remember what you are doing... The last measurement was before I wrote the new algorithm, I think it should save about 1-2% what is ok but not solves problems in the most cases. For me, it was more important to understand what's happening and have my data under control, being able to handle variable streams and have clean samples, maybe this can solve problems for some people, at least I have a better feeling with it :) (related to my projects and my understanding of my projects) Ofc I complete see your arguments that noise in the LSB can not be heard!

Best Regards

masterxq commented 1 year ago

I really wonder, but it seems like I did remember wrong, here are my esp32 CPU stats. And I'm still using i2s_write. My modifications should not be in a visible range if the total usage is 2%... But ok, the vTaskGetRunTimeStats measures the stats since boot, right after how long I need to play a song after boot and how long time I wait before calling vTaskGetRunTimeStats the stats will vary. However I2S will never reach 10% or more.

(44,1 kHz 320 kBit/s mp3, streamed over https)

I was that sure it was more than 10%. But ok it's not, at least no longer^^

Maybe I had a mistake in my code on my last measurement.

sle118 commented 1 year ago

@masterxq do you feel is't something worth exploring?

masterxq commented 1 year ago

@masterxq do you feel is't something worth exploring?

As the algorithm is completed and this will save bus time, cpu time and give better samples (the least significant bits are not random) and is easier to maintain (less hacky) and expand (eg other sample width). I would implement it, maybe not as high priority task, but I would do it :) And I have in my audio toolchain. It's working flawless from day one.

But at the end it is up to you :) I just want to share my code, as yours helped me to understand the problem :)

sle118 commented 1 year ago

@masterxq thank you for exploring these improvements. Il see if/when we can get to it.

sle118 commented 1 year ago

Given that the solution is 16 bits, this isn't going to work for us. Closing

sle118 / squeezelite-esp32

SPDIF algorithm #221