Encrypting using multiple keys in parallel

mkaluza commented 6 years ago

Hi @glenvt18 Since you probably know this library very well, I hope my question won't be a problem for you. I know that presently bitslice algorithm can process many packets at once, but only one key. Is that a feature of this method or just nobody ever needed to use many keys, so it's not implemented (although it could be).

Our usecase is that we'll be scrambling an MPTS in real time and each service has different keys. Also our batch is small (entire mux output buffer is only 128pkts), so right now much of the processing power would go to waste, because we'd never have 128pkts from one service at a time.

So my question is: Is it possible for the algorithm to encrypt i.e. 2x64 or 4x32 (in sse mode) with two/four different keys at once?

mkaluza commented 6 years ago

And thank you for your work - that kind of speedup is awesome ;) We have Pentium G4560 as a test machine and it achieves 1377mbps in bitslice bench using ssse3 :)

It's a pity your patches weren't merged into the original lib...

have you ever thought about AVX? :) Not that it isn't fast enugh - just for the sake of doing it :)

glenvt18 commented 6 years ago

HI @mkaluza At first glance, multi-keying should work. It will require setting individual bytes in the expanded key's bs_word using masks. It is possible to have up to BS_BATCH_BYTES keys and as low as 8 packets per batch. With the current implementation, packets should be interleaved. For example, 4x32 batch:

s0[0] s1[0] s2[0] s3[0]    s0[1] s1[1] s2[1] s3[1]    ...   s0[31] s1[31] s2[31] s3[31]

where sN[i] means i-th packet of the N-th stream.

Keys can be set up using subsequent calls of something like

void dvbcsa_bs_key_set_mx(const dvbcsa_cw_t cw, struct dvbcsa_bs_key_s *key,
                          int start, int len)

where start and len define the mask. For 4x32 (start, len) can be

start=0, len=4 
start=4, len=4 
start=8, len=4 
start=12, len=4

And you'll have to keep the length of the pcks[] array equal to the batch size (128 for sse). In other words, pcks[i].data shouldn't be NULL for any 0 <= i < 128. It won't hit performance much. Any suggestions?

AVX (256, 512) should work out-of-the-box after adding AVX intrinsics. If you can test it, I'll add them (or you can submit a PR). I just don't have HW to test.

nto commented 6 years ago

If someone comes up with AVX patches (both AVX2 and 512), I can test them.

mkaluza commented 6 years ago

@glenvt18 Thanks for the reply :) I'll look into it.

As for AVX - I looked at the header files and it doesn't seem complicated, although right now I know nothing about SIMD programming yet, but maybe it's a good time to learn something new. I have i5-6400 and E3-2603 available for testing and benchmarks and they both support AVX2, so I have something to work with. No AVX512 though :(

It seems that for some SSE instructions AVX just made them wider, so maybe it won't be difficult. I'll get back when I have something :)

Thanks for you help

mkaluza commented 6 years ago

@glenvt18 I tried understanding what dvbcsa_bs_key_set does and creating dvbcsa_bs_key_set_mx based on that, but I failed... Could you help me with that if it's not too much trouble? I created a test case for it (I hope I got that packet interleave right) and a stub, but I haven't got any idea about the function :/

https://github.com/mkaluza/libdvbcsa/tree/multikey

glenvt18 commented 6 years ago

@mkaluza Sure, I'll do all this work myself. You'll only have to run tests. Is the suggested API OK with you? I really don't want to bloat BS key buffer. I want to keep working buffers as small as possible to minimize cache misses. I think it's better to extend existing test framework (testbitslice).

mkaluza commented 6 years ago

I agree that memory structures should remain as optimal as possible.

But first I'd like to ask about key setting to understand what is possible and what is not, so that I can figure out how to use it and what to avoid. For this example let's assume we have 64bit word width, which means at most 8 keys and each batch 8 pkts wide then.

From most important to least important: 1) Are there any key patterns I can or cannot set? Lets say I have 4 streams (4 keys), with significantly varied bandwidth and I want to allocate batches proportionally to that like so: AAAA BB C D Is it possible? Are all widths allowed, or i.e. only power of two? Could I set AAA BB CC D?

2) How intensive computationally is key setting (just roughly - much or little)? Because if it's not, 3) and 3a) don't matter - we can simply always do everything.

3) Can I modify the key? let's say I have AAAA BB C D and I want to have AAA BBB C D - will it be enough to change key[3] or do we have to recalculate everything always? 3a) if key can be modified, can it be moved? If I have AAAA BB C D and want to have AAA BB CC D, could this be done with key[3:4] = key[4:5]; key[5] = C; or does it have to be key[3] = B; key[5] = C? In this example it makes little difference , but there easily could be a pathological case with 32 (or 64 with AVX512) keys where you'd like to expand the last one at the expense of the first one, which would result in 30 vs 1 calculation.

points 3) and 3a) are absolutly optional - I just want to know what is possble and what is not.

The way I would like to use it on high level is like that (or something like that):

struct cw_list {
  uint8_t cw[8];
  uint8_t len;      //number of bytes, not bits, sums to less than BS_BATCH_BYTES
}
dvbcsa_bs_key_set_multi(struct cw_list cw_list[], struct dvbcsa_bs_key_s *key);

where cw_list array is up to BS_BATCH_BYTES long and is either null terminated or lenth parameter is required - makes no differnce.

I would extend struct dvbcsa_bs_key_s with uint8_t len[BS_BATCH_BYTES]; uint8_t cw_count; to store batch sizes passed to dvbcsa_bs_key_set_multi function. I don't think this addition will impact cache performance as it would be at the and of the (relatively large) structure and not touched during normal operation, but correct me if I'm wrong.

If key modification would be possible, I'd also add uint8_t cw[BS_BATCH_BYTES][8] to struct dvbcsa_bs_key_s and a function to update the keys (can't imagine the syntax right now, besides it's completely optional right now).

That high level function would call the function you write in any way you see fit.

As for de/encryption, I'd like to encapsulate packet interleave inside the API, i.e I'd like to pass an array where each batch of packets is continous, and I'd like the API (multikey versions of dvbcsa_pkt_buf_load and dvbcsa_pkt_buf_store) to handle storing them in the buffer and extracting in the right order (packet interleave is an implementation detail - user should not need to know about it especially if packets are copied anyway).

I'd like the API to accept NULLs as end of batch - the same way it now does with entire list. This way the method signature could stay the same (no need to pass packet counts per batch and no need to set fake pointers and copy garbage data back and forth). And also the behaviour for old single key encryption wouldn't change as well - it would just be a single null terminated batch of BS_BATCH_BYTES width.

And everything below that (dvbcsa_bs_stream_cipher_batch, dvbcsa_bs_block_decrypt_batch, dvbcsa_bs_block_encrypt_batch) is up to you.

There... It's the idea, I'm open to discussion :) If you're fine with it, I'll start implementing that.

glenvt18 commented 6 years ago

@mkaluza I've just wanted to ask you about the bandwidth allocation but you asked first:)

Are there any key patterns I can or cannot set? Lets say I have 4 streams (4 keys), with significantly varied bandwidth and I want to allocate batches proportionally to that like so: AAAA BB C D Is it possible? Are all widths allowed, or i.e. only power of two? Could I set AAA BB CC D?

Yes. You can allocate freely within bs_word. That is, for 64 bit you have 8 slots and can allocate any set within. Constrains: len = [1, 8), start >= 0, start + len < 8. If these are not met I'll call abort(). For example:

AAAAABBC
dvbcsa_bs_key_set_mx(0, 5);
dvbcsa_bs_key_set_mx(5, 2);
dvbcsa_bs_key_set_mx(7, 1);

It's just like setting a bit mast. You can even allocate AAABBBAB.

How intensive computationally is key setting (just roughly - much or little)? Because if it's not, 3) and 3a) don't matter - we can simply always do everything.

Key setup is cheap. You can call it before every encrypt(). This might be helpful when you synchronize key updates: you just have to copy 8 bytes (cw). I've just called it 40 times before every encrypt and didn't notice any change. Run benchbitsliceks to get exact numbers. dvbcsa_bs_key_alloc() might not be that cheap.

Can I modify the key? let's say I have AAAA BB C D and I want to have AAA BBB C D - will it be enough to change key[3] or do we have to recalculate everything always?

Yes: dvbcsa_bs_key_set_mx(2, 1). Think of it like setting a bit mask.

I would extend struct dvbcsa_bs_key_s with uint8_t len[BS_BATCH_BYTES]; uint8_t cw_count; to

dvbcsa_bs_key_s* is an opaque pointer in dvbcsa.h. You don't have access to it's members.

My suggestion:

/* stream */
struct dvbcsa_bs_mx_slots_s {
  int start;
  int len;
  dvbcsa_bs_batch_s *pcks; /* NULL terminated */
};

/* slots are NULL terminated array */
void dvbcsa_bs_key_set_mx(const dvbcsa_cw_t cw, struct dvbcsa_bs_key_s *key,
                                               struct dvbcsa_bs_mx_slots *slots);

void dvbcsa_bs_encrypt_mx(const struct dvbcsa_bs_key_s *key,
                       const struct dvbcsa_bs_mx_slots_s *slots,
                       unsigned int maxlen);

This ensures you use the same slots for key setup and encryption.

EDIT Or even better

/* stream */
struct dvbcsa_bs_mx_stream_s {
    dvbcsa_cw_t cw;  /* key for the stream */
    int start_slot;
    int n_slots;
    dvbcsa_bs_batch_s *pcks; /* data=NULL terminated */
};

/* streams are NULL terminated array */
void dvbcsa_bs_key_set_mx(struct dvbcsa_bs_key_s *key,
                          struct dvbcsa_bs_mx_stream_s *stream);

void dvbcsa_bs_encrypt_mx(const struct dvbcsa_bs_key_s *key,
                          const struct dvbcsa_bs_mx_stream_s *streams,
                          unsigned int maxlen);

EDIT. This is also possible (and looks better):

/* stream */
struct dvbcsa_bs_mx_stream_s {
    dvbcsa_cw_t cw;  /* key for the stream */
    int start_slot;
    int n_slots;
    dvbcsa_bs_batch_s *pcks; /* data=NULL terminated */
};

void dvbcsa_bs_key_set_mx(struct dvbcsa_bs_key_s *key,
                          struct dvbcsa_bs_mx_stream_s *stream);

void dvbcsa_bs_encrypt_mx(const struct dvbcsa_bs_key_s *key,
                          const struct dvbcsa_bs_mx_stream_s *streams,
                          int n_streams,
                          unsigned int maxlen);

You have to keep one dvbcsa_bs_key_s buffer storage per thread. (call dvbcsa_bs_key_alloc once). Take it like a space in the tread stack.

mkaluza commented 6 years ago

Ok, it seems fine with few adjustmens below (move cw into struct slot, removed 's' from the name), . This way it'll be easy for the user as the key and batch description and pointer will be all in one place and will be easy.

Key setting is indeed cheap - my cpu can do 4,2M/s, which gives .25us, when the whole cycle takes around 180us. I don't care about alloc as I assume I have to do it only once.

/* stream */
struct dvbcsa_bs_mx_slot_s {
  int start;
  int len;
  dvbcsa_cw_t cw;
  dvbcsa_bs_batch_s *pcks; /* NULL terminated */
};

/* slots are NULL terminated array */
void dvbcsa_bs_key_set_mx(struct dvbcsa_bs_mx_slot_s *slots,
                          struct dvbcsa_bs_key_s *key);

void dvbcsa_bs_encrypt_mx(const struct dvbcsa_bs_key_s *key,
                          const struct dvbcsa_bs_mx_slot_s *slots,
                          unsigned int maxlen);

I'm only thinking about renaming len to width - it's more width in fact and it would also be less confusing - len might be mistaken as describing the length of pcks . Or size?

glenvt18 commented 6 years ago

@mkaluza How about my latest (edited) suggestion with dvbcsa_bs_mx_stream_s, n_slots, n_streams?

mkaluza commented 6 years ago

looks very good :)

I only wonder why dvbcsa_bs_encrypt_mx has n_streams param, while dvbcsa_bs_key_set_mx doesn't? I'd move it to dvbcsa_bs_key_set_mx, which can store it inside struct dvbcsa_bs_key_s - after all it's something that won't change independently of the key, so it can be provided only once and stored there. It's a part of the structure of streams that you define and change only when setting the keys. On the other hand keeping data pointers inside struct dvbcsa_bs_mx_stream_s and accessible to the user, as you suggested, is justified as they can (and will) change, but as long as the structure remains the same, everything is ok.

And struct dvbcsa_bs_key_s would remain an opaque pointer - number of streams would only be passed as a param to dvbcsa_bs_key_set_mx and later retrieved by encrypt/decrypt functions, so it'd be used only internally.

glenvt18 commented 6 years ago

@mkaluza

I only wonder why dvbcsa_bs_encrypt_mx has n_streams param, while dvbcsa_bs_key_set_mx doesn't?

dvbcsa_bs_key_set_mx takes a single stream.

I'd move it to dvbcsa_bs_key_set_mx, which can store it inside struct dvbcsa_bs_key_s

dvbcsa_bs_key_s doesn't care of the number of streams. The allocated storage is always the same constant size (~batch size bs_word size). Its purpose is storing expanded keys in the internal properly aligned* buffers.

mkaluza commented 6 years ago

ok, it's clear now :)

There's one more thing. Would it be possible to add bounds checking to mx equivalent of dvbcsa_pkt_buf_load/store? Here: https://github.com/glenvt18/libdvbcsa/blob/master/src/dvbcsa_bs_algo.c#L48. So that when it is copying packets, it'll stop when encountering NULL OR when i == stream->n_slots * 8.

Why? Knowing that I can never process more than BS_BATCH_SIZE packets at one time anyway, I can allocate one pcks[BS_BATCH_SIZE] array and only update stream->pcks pointers. But if a stream was completely filled, there would be no space for NULL terminator (it would be a start of another stream).

That way NULL would only be used to signal that there are less packets that can fit the stream, but it would never try to read more (I think that is a good thing in general...)

glenvt18 commented 6 years ago

@mkaluza In this case it's easier to add n_pcks to dvbcsa_bs_mx_stream_s. And it is more flexible as well. Do you agree?

mkaluza commented 6 years ago

I thought leaving null termination and adding bounds check would be an extension to the old api to avoid confusion, but I'm not attached to any solution as long as it gets the job done :) n_pcks is more explicit, NULL is more 'the old way', but equivalent - it's your lib, pick the one you like more :)

EDIT: If I were to decide, I'd leave NULL + bounds check

glenvt18 commented 6 years ago

@mkaluza Well, I have to check n_pcks anyway using min(n_pcks, max_packets_per_stream(n_slots)).

Here is how you application can use the API: https://pastebin.com/H27aeUzQ

Some notes:

encrypt() is called right after one of the stream batches is full. At this moment the fill ratio of other streams will be less than 100%. So, you'll have some performance drop.

You might consider using array of pointers to stream structure, that is dvbcsa_bs_mx_stream_s **streams. That can be NULL-terminated.

mkaluza commented 6 years ago

Great :) Thanks for the example. It'll certainly be useful and can become part of docs/examples later as well.

glenvt18 commented 6 years ago

@mkaluza That example just shows a simple and straightforward approach. And you'll always have some performance penalty unless you apply a smarter bandwidth allocation algorithm. The main difficulty with a multi-key application is that, at some point, all streams must be encrypted simultaneously at one encrypt() call. In terms of performance, single-key approach with some buffering is certainly better. The only situation when it can't be used is when you have a huge number of very-low-bitrate streams and can't afford a large processing delay.

What about

void dvbcsa_bs_encrypt_mx(const struct dvbcsa_bs_key_s *key,
                          const struct dvbcsa_bs_mx_stream_s * const *streams,
                          int n_streams,
                          unsigned int maxlen);

(array of pointers to a stream structure)? I think this is more flexible. Should it be NULL-terminated?

mkaluza commented 6 years ago

Yes, it is more flexible, but imho it's too flexible. Streams are not something you change often - when you start a transmition you usually know what you're sending and it changes rarely, if ever, during the lifetime of the application. It's rare enough that you can afford to just realloc everything, or even restart the application (we do it). The other benefit when compared to live stream change is that doing a merge of old and new configuration in C, where you have pointers to pointers to pointers may be tricky and it may simply be easier to just drop it and load from scratch.

So from my point of view an array of up to BS_BATCH_BYTES streams and n_streams parameter is good enough.

You seem to be "let's do it perfectly the first time" type ;) Unfortunately I'm not and at some point I just have to try it out and see what works and what doesn't because I get lost in deliberating countless solutions that are very similar. And although it sometimes results in some code rewrites I find it very hard to do it any other way.

So I have suggestion for now - let's leave it as it was here: https://github.com/glenvt18/libdvbcsa/issues/3#issuecomment-359179167 and see how it works out. Even if it turns out there's a need to change something later, I'll do it. After all it'll all be higher level functions that call either set_cw(key, cw, start, width) or load/store_pkts(dst, dvbcsa_bs_batch_s *pcks, start, width) and those low level functions won't change. Besides you don't have to commit it to master at once and since in the beginning we'll probably be the only ones using it (or even only me), I think we can live with API instability for a some time. Do you agree?

As for bandwidth allocation (in scrambler) in the beginning the algorithm will be even simpler due to the way our hardware works: modulator has relatively large buffer (128pkts), so we build this buffer periodically, at constant intervals (6-9ms). We can run scrambler then, each time setting some keys if needed (which is cheap). Now with AVX2 we'll increase that buffer to 256 pkts and forget about it for now - after all the performance is so huge that even 30% loss is acceptable for now, especially that we have only 800mbps worth of output right now.

Later I'll add small constant delay buffer so I can pre-buffer some packets and fill the scrambler to 100%. But first the simple stuff to make it work.

PS I added the description in AVX2 commit you asked for.

glenvt18 commented 6 years ago

So, the proposed API is OK with you in general. I might change some bits here and there (names etc.). It will be living in a separate git branch for a while.

About bandwidth allocation. Note that each encrypt() can process exactly 8 packets per slot regardless of the word size. With larger words (AVX2) you only have more slots available per each encrypt() call (but encrypt() runs longer). A smart (but still simple) allocation algorithm can assign slots instantly before each encrypt() depending on the current need. And you can call encrypt() as many times as needed at that moment (that's why I suggested pointers). In other words, imagine you have an encryption buffer of 32 packets. Every time you need scrambling, you put a pair (key, packet) into the buffer. When the buffer is full, encrypt() is called. Isn't it simple? The only difference is that in our case the granularity is 8 packets instead of one.

And, in the meantime, I'll try to get it work:)

mkaluza commented 6 years ago

Yes, it's OK, as far as I can tell you without trying it out.

There's another reason for me to prefer simpler over more flexible. It's quite a niche stuff - PC TV is nitche, PC TV on Linux is a niche within that niche, writing code for that stuff is another niche within, and the need for multiple keys at once brings only two possible use-cases to my mind. So there's no point in making it ultra flexible (and therefore a bit complex) as anyone who'll ever have a need to use it will probably prefer a simple working example to understand how the upper layer works so that he can craft something suitable to his needs (which he should easily manage if he's got so deep 'down here') than a complex API most of which he doesn't need and only has to look out for. And probably neither you nor I can predict what he'll need, so the API would be missing something anyway, therefore there's no point in spending too much time on it.

Take my use-case - we have FIFO ring buffers for each stream and each time we'll be encrypting something we need to update the pcks.data pointer anyway, so there's no need for another layer of pointers. I know I could always assign 32 separate streams with some of them sharing the same key and update only one key to rebalance bandwidth, but that would probably be very small optimization and having another layer of pointers right now would cause me more trouble than it's worth.

Probably when I start optimizing things, I'll write something that uses your low level functions or even something just based on that. For example - in the end, I'll have no need for struct dvbcsa_bs_batch_s at all because I'll only handle TS packets in continuous (or piece-wise continuous, as usually one key encrypts audio and video streams) buffers, so the load function can itself figure out where the payload of each packet starts and ends without the need for a layer of indirection, which would save a bit of cycles and cache.

As for bandwidth allocation I'll have more to say when I try it out - there are also some timing issues to keep, so filling up the scrambler buffers is not that important, considering the vast amounts of bandwidth available, and it becomes no- that-simple if you take timing into consideration.

But having multiple keys will save me most of the trouble and waste as scrambler bandwidth waste would be only as big as free mux bandwidth, so around 30% mostly. I can live with that for now if it gives me simple algorithm.

glenvt18 commented 6 years ago

@mkaluza Could you build https://github.com/glenvt18/libdvbcsa/tree/print_r and run testbitslice for AVX2 and AVX512. Please upload the output to pastebin.com.

mkaluza commented 6 years ago

here you are https://pastebin.com/5x0dDknF

glenvt18 commented 6 years ago

Thanks.

On Fri, Jan 26, 2018 at 01:40:03AM -0800, Marcin Kałuża wrote:

here you are https://pastebin.com/5x0dDknF

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/glenvt18/libdvbcsa/issues/3#issuecomment-360731649

-- glenvt18

glenvt18 commented 6 years ago

@mkaluza Could you test (make check) https://github.com/glenvt18/libdvbcsa/tree/test1 with AVX2 and AVX512 and upload test/testbitslice.log to pastebin.

mkaluza commented 6 years ago

Just dropped them here, I hope it's not a problem? Tests seem to check out :)

testbitslice-avx2.log testbitslice-avx512.log

glenvt18 commented 6 years ago

@mkaluza Nice (I just don't have that avx512 patch and I run avx2 with sde). BTW It's almost ready. I have to do a couple of tweaks and some cleanup.

mkaluza commented 6 years ago

you do now :) https://github.com/mkaluza/libdvbcsa/commits/avx512

Great that it works ;) I will be able to start working on my stuff thanks to it :) thanks to you actually :)

sde is cool :) and AVX512 has some cool features that can speed up at least those transpose functions (avx2 as well, but it has no scatter). The most bizzare thing that came to my mind is to load entire sbox or sbox_perm LUT into zmm registers (which there are 32 each 64 bytes wide, so it would take up only 8). And do LUT from there ;) But that's just a crazy idea for now. Another thing is that sbox_permute doesn't tak that much time anymore (its 2200 vs 3300 mbps with it disabled vs 9500mbps with block cipher disabled entirely), so I'm not sure how much is there to gain.

I will also have some questions about those transpositions, but later - let's not do too many things at once :P But they are only about 10% of the entire workload, so there isn't much to gain there, but still... even 5% is nice :) And probably that's it... unless someout comes up with some ingenious implementation (probably on avx512, probably completely different) I can't see any possibilities for big gains anymore

glenvt18 commented 6 years ago

@mkaluza You can disable transposes (add return to the beginning of the function) and run benchmarks.

You should implement BS_SWAPXX_LE (see dvbcsa_bs_transpose.h, dvbcsa_bs_neon.h). They are tested by testbsops.

glenvt18 commented 6 years ago

@mkaluza Here you are: https://github.com/glenvt18/libdvbcsa/tree/mx-wip

I'll update docs later. This is a working branch. I may update it any time.

glenvt18 commented 6 years ago

@mkaluza Update: added key setup benchmark.

mkaluza commented 6 years ago

@mkaluza https://github.com/mkaluza You can disable transposes (add return to the beginning of the function) and run benchmarks.

I know, I did exactly that.

You should implement BS_SWAPXX_LE (see dvbcsa_bs_transpose.h, dvbcsa_bs_neon.h). They are tested by testbsops.

So far I failed to do it efficiently. Current implementation based on simple bitwise operations (which have higher throughput that unpacks, shuffles and permutes that all need port p5) seems very efficient.

Instead of trying to do them faster i was thinking about not doing them at all :) at least some of them. Right now everything is done a bit "by the book" inside dvbcsa_bs_algo: copy first, transpose, de/encrypt, transpose back, transpose again etc. I was thinking about merging few of those to avoid some roundtrips to cache, namely:

load with first transpose
two inner transpositions between the ciphers
final transposition and store.

With AVX2 and 256 packets the working set is 47k, which already exceeds L1 cache size. It's even more so with AVX512 (94k, 3x cache size), which I suspect is the reason for such small speed increase. By merging those operations maybe I'll be able to save some waits on L2 cache. But that's just an idea for now and I need to understand the algorithm a bit more to check if it's even possible.

AVX2 gather and AVX512 scatter might be helpful here.

But don't worry - I understand it would cause major changes in code layout so I'm not going to bother you with it any time soon :)

But first I'll put your new API to use :) thank you very much :)

glenvt18 commented 6 years ago

load with first transpose

two inner transpositions between the ciphers

final transposition and store.

impossible for encrypt() - CBC starts from the last block of the packet (see https://en.wikipedia.org/wiki/Common_Scrambling_Algorithm).

You don't have to worry about pkt_buf because processing is done on 8-byte blocks and data are accessed sequentially most of the time. In fact, pkt_buf is much better than accessing original packet data (see 4d904cddd). Internal ciphers buffers are more important in terms of cache usage. I did make attempts to shrink some of them but only had regressions.

Working with high bit rates implies heavy memory use everywhere, not only in the scrambler, or your program alone. So, L1 misses may be inevitable. You can also benefit from running one scrambling thread per CPU core. A big problem here is that recreating a 'real' environment is much more difficult than running a simple benchmark program.

mkaluza commented 6 years ago

Ok, one thing to get clear about in the beginning. I'm not questioning anything you did here. In fact, the more I read into it, the more I admire it. But this also means that probably all the clever questions were already asked, so maybe there are some stupid questions, that could provoke some clever answers... So if it seems my questions are stupid and that I haven't read the code carefully, then you're right, but it's on purpose :)

load with first transpose
two inner transpositions between the ciphers
final transposition and store.

impossible for encrypt() - CBC starts from the last block of the packet (see https://en.wikipedia.org/wiki/Common_Scrambling_Algorithm).

Ok, but maybe it's possible for decrypt and that's what 99,9% people use it for.

You don't have to worry about pkt_buf because processing is done on 8-byte blocks and data are accessed sequentially most of the time. In fact, pkt_buf is much better than accessing original packet data (see 4d904cd https://github.com/glenvt18/libdvbcsa/commit/4d904cddd42bc8769cb62252e053754f8940679f). Internal ciphers buffers are more important in terms of cache usage. I did make attempts to shrink some of them but only had regressions.

Right, now I see that load/store can't be touched as they deal with unaligned memory. But maybe it's possible to transpose entire pkt_buf in the beginning/end? Because right now pkt_buf access pattern is column, which is not efficient (and the wider the word, the bigger the problem - for AVX512 steam cipher register file has 23k, so there's little cache left for packet data). Besides doing the transpose in one big piece might allow to use other SIMD instructions to do in more efficiently (and only once - right now each cipher does it). This way only steam cipher would have to go from byte slice to bit slice, but this has linear access pattern. It would also allow to use SIMD for XOR where 64 bit register is used now.

Working with high bit rates implies heavy memory use everywhere, not only in the scrambler, or your program alone. So, L1 misses may be inevitable. You can also benefit from running one scrambling thread per CPU core. A big problem here is that recreating a 'real' environment is much more difficult than running a simple benchmark program.

Of course, but as you said it yourself - if the benchmark runs faster, usually the program does as well.

glenvt18 commented 6 years ago

This way only steam cipher would have to go from byte slice to bit slice

That's not that easy. Matrix sizes are different. Block code eliminates 32 step by combining it with load/store. You can't get away with that. And even if you manage that, I think all you'll get is ~2%.

I've just run a couple of benchmarks on Intel with SSSE3:

remove block bs transform code completely: +5%
don't use pkt_buf: -10%

And when I use --enable-alt-sbox I have another +10%. The effect of that option depends on gcc version. So, you have +/-5-10% fluctuation after upgrading your gcc toolchain (and that can be considered OK). I use 7 toolchains for testing. With clang there is usually a 20-30% drop. In a 'real' world you could loose up to 50%! And up to 70% if your batch buffers are not handled properly (but some users are still satisfied with the remaining 30%).

I'd recommend you to measure your application performance carefully, batch buffers fill ratio, delays etc. Also take care of thread contention. Have 2-3 libdvbcsa.so files (ssse3, avx2, avx512) and change them to see the difference. Now you have a version with the processing delay as low as only 8 packets. This gives you amazing possibilities. If you really need more performance from the scrambler (from benchbitslice), try to optimize block cipher first. Only when you have the block cipher perform 2x-3x faster you can notice the (possible) effect of transforms optimization.

And, please, use the plain text option in your email client (or add '> ') in replies.

mkaluza commented 6 years ago

Ok, I've tried using the API and it's almost perfect :) I could use one more field there but since it's not directly related to scrambling operation I'll add it in my own fork. But so far the usage is almost completely transparent :)

I'll let you know when I have CAS--CAM chain completed and tested :)

Thanks for great work :)

glenvt18 commented 6 years ago

Nice :) If you have some data not related to scrambling, I think it's better to keep them at the application level. You don't have to create a fork for that. After some testing this work will eventually be merged into the master. Do you think we still have to keep cw in struct dvbcsa_bs_mx_stream_s?

Take a look at https://github.com/glenvt18/libdvbcsa/commits/time-measure It helps to measure 'real world' performance. Adapt it for multi-stream (load_mx/store_mx) and add a mutex if you want to call encrypt from multiple threads. Make sure the 'fill' output is close to 100%.

BTW I've implemented the register-based approach to the block cipher rounds loop: https://github.com/glenvt18/libdvbcsa/commits/block-reg It saves lots of L1 cache. See if it helps on AVX2/AVX512.

mkaluza commented 6 years ago

Well... for me it is related to scrambling the same way CW is. I need even/odd marker there. This way my packet crunching code deals with only one structure and one pointer and a set of fields, and the management code deals with the other set. No locks needed. But I'll look deeper into it and let you know. For now I'm happy that it works :) Last night I managed to test entire loop (my scrabler talking to CAS, scrambling, sending ECMs to CAM and getting a descrambled signal back :). So it's a success :)

I'll look into those time measurements. Right now I did a quick benchmark of block-reg using AVX2 and its slower - 1690 vs 2190. Don't have AVX512 though - @nto, can you help here?

Thanks :)

glenvt18 / libdvbcsa

Encrypting using multiple keys in parallel #3