Peculiar decode matrices being constructed by build_decode_matrix_into_space()

chrisstjohn commented 1 year ago

I'm interested in pre-calculating some decode matrices so I used some of the "under the hood" functions of fec.c

With a trivial example of k=2, n=3 I get the following:

static const uint8_t encoder_2_3[6] =
{
    0x01, 0x00, 
    0x00, 0x01, 
    0x03, 0x02, 
};

static const uint8_t decoder_2_3_0x3[4] =
{
    0x01, 0x00, 
    0x00, 0x01, 
};

static const uint8_t decoder_2_3_0x5[4] =
{
    0x01, 0x00, 
    0x8f, 0x8e, 
};

static const uint8_t decoder_2_3_0x6[4] =
{
    0x01, 0x00, 
    0x8f, 0x8e, 
};

there are three erase cases identified by a bitmap

0x3 = parity loss
0x5 = second data loss
0x6 = first data loss

but note that the 0x5 and 0x6 decode matrices are identical - which is wrong.

I think I traced the problem to here build_decode_matrix_into_space() which does some weird stuff whilst trying to decimate the encoding matrix:

void
build_decode_matrix_into_space(const fec_t*restrict const code, const unsigned*const restrict index, const unsigned k, gf*restrict const matrix) {
    unsigned char i;
    gf* p;
    for (i=0, p=matrix; i < k; i++, p += k) {
        if (index[i] < k) {
            memset(p, 0, k);
            p[i] = 1;
        } else {
            memcpy(p, &(code->enc_matrix[index[i] * code->k]), k);
        }
    }
    _invert_mat (matrix, k);
}

I removed the apparent "optimisation" and it all springs into life:

void
build_decode_matrix_into_space(const fec_t*restrict const code, const unsigned*const restrict index, const unsigned k, gf*restrict const matrix) {
    unsigned char i;
    gf* p;
    for (i=0, p=matrix; i < k; i++, p += k) {
        memcpy(p, &(code->enc_matrix[index[i] * code->k]), k);
    }
    _invert_mat (matrix, k);
}

Now this is really really really old code and pretty central the the whole decoder; am I missing something or is it broken?

WojciechMigda commented 1 year ago

What values did you pass with index array?

chrisstjohn commented 1 year ago

What values did you pass with index array?

0x3 -> { 0, 1 } 0x5 -> { 0, 2 } 0x6 -> { 1, 2 }

an alternative fix for me is to change p[i] = 1 to p[index[i]] = 1 since I think it's trying to make a row of identity rather than copy it from the encoding matrix

WojciechMigda commented 1 year ago

That is a good find and fix! I cannot explain where did 0x8f, 0x8e rows come from in decoder_2_3_0x5 and decoder_2_3_0x6 matrices. These rows should be copied from the encoder matrix, but instead the look like some garbage values.

chrisstjohn commented 1 year ago

Those rows originate from the encoder matrix which is then inverted so they're not garbage - decoder_2_3_0x5 at least really is the inverse of {{ 1, 0 } { 3, 2 }}

I'm still puzzled how on earth this could work for anybody else but since I'm not using zfec for its intended purpose - just borrowing some of its math - I can't comment whether or not it does work for other people.

@WojciechMigda do you want me to submit a PR? Looks like there are a ton of unit tests that would check this for us?

chrisstjohn commented 1 year ago

One possibility is that the code that normally uses this function re-orders the shards to replace data with parity.

e.g. Encode A, B -> X, Y, Z lose X Decode Z, Y -> A, B

whereas I'm trying to decode Y, Z -> A, B

WojciechMigda commented 1 year ago

Those rows originate from the encoder matrix which is then inverted so they're not garbage - decoder_2_3_0x5 at least really is the inverse of {{ 1, 0 } { 3, 2 }}

Oh, I thought these are values right after the loop.

I'm still puzzled how on earth this could work for anybody else but since I'm not using zfec for its intended purpose - just borrowing some of its math - I can't comment whether or not it does work for other people.

There are some tests and they pass. In particular the Haskell ones use QuickCheck to construct input data which is then subject to roundtrip encode/decode, and they pass too.

do you want me to submit a PR? Looks like there are a ton of unit tests that would check this for us?

I am not a maintainer of this project, I just happen to be on the subject for the past few weeks. As for the existing tests it would be worth checking if they adequately cover this scenario.

One possibility is that the code that normally uses this function re-orders the shards to replace data with parity.

I don't see anything like that in zfec --- fec_decode calls build_decode_matrix_into_space right away. On the other hand, original Luigi Rizzo's code does have an additional step: function called shuffle is called before build_decode_matrix. We'd need to confirm that it does the reordering you've mentioned.

WojciechMigda commented 1 year ago

@chrisstjohn maybe you already figured this out by yourself, but anyway I looked closely into this and these are my conclusions:

The code should stay as it is. fec_decode requires that block indices passed with index array are ordered to satisfy this assertion:

(index[i] >= k) || (index[i] == i)

(https://github.com/tahoe-lafs/zfec/blob/master/zfec/fec.c#L524)

If the above does not hold then fec_decode will abort execution. [1]

Original Rizzo's fec did reorder indices within fec_decode (index contents was mutable), but in zfec the same reordering was relocated to python API wrapper: https://github.com/tahoe-lafs/zfec/blob/master/zfec/_fecmodule.c#L454 .
With blocks that fec_decode works on being in order which satisfies that index[i] == i it becomes obvious that p[i] = 1 is equivalent to p[index[i]] = 1.

[1] Unless assertions are disabled, but this is not a default setup.

chrisstjohn commented 1 year ago

Thanks @WojciechMigda for your analysis and thoughts.

My feeling is that an ordering requirement buried deep in a 'C' library that is implemented way outside in the Python wrapper is a rather brittle design. As you say, right now there is an assertion in fec_decode but I can't see the harm of

changing the assertion to assert(index[i] < code->n)
changing p[i] to p[index[i]]

As I agree that right now, in this case, with this Python wrapper, index[i] == i

Even in terms of performance there probably wouldn't be an extra indirection as index[i] is already fetched and available to the compiler.

tahoe-lafs / zfec

Peculiar decode matrices being constructed by build_decode_matrix_into_space() #72