rnpgp / rnp

RNP: high performance C++ OpenPGP library used by Mozilla Thunderbird
https://www.rnpgp.org
Other
198 stars 55 forks source link

Notes after trying on performance testing. #412

Open catap opened 7 years ago

catap commented 7 years ago

I've tried to run very simple and dirty performance testings to compare performance rnp (last master) and GnuPG 2.1.23.

All tests run on my laptop with a lot of running background application and I didn't re-run it few times to makes clear result. Anyway it shows the picture and makes a derection to optimisation.

Short summary:

Test restults (all times in seconds, used same key): rnp:

gpg:

CPU using at encrypting 1gb file:

Total: 7474 samples
    5843  78.2%  78.2%     5843  78.2% __platform_memccpy
     816  10.9%  89.1%     3795  50.8% _deflateCopy
     256   3.4%  92.5%      256   3.4% Botan::SHA_160::sse2_compress_n
     229   3.1%  95.6%      231   3.1% _inflateMark
     121   1.6%  97.2%      252   3.4% _pgp_cipher_cfb_encrypt
     103   1.4%  98.6%      103   1.4% _CAST_encrypt
      53   0.7%  99.3%      836  11.2% _deflateSetDictionary
      11   0.1%  99.4%       11   0.1% _zError
       9   0.1%  99.6%      111   1.5% _CAST_ecb_encrypt
       9   0.1%  99.7%      132   1.8% _botan_block_cipher_encrypt_blocks
       8   0.1%  99.8%      122   1.6% _EVP_EncryptUpdate
       5   0.1%  99.9%        5   0.1% __kernelrpc_mach_vm_deallocate_trap
       4   0.1%  99.9%      116   1.6% _EVP_cast5_ecb
       2   0.0%  99.9%        2   0.0% _botan_block_cipher_decrypt_blocks
       1   0.0%  99.9%      133   1.8% Botan::mgf1_mask
       1   0.0% 100.0%        1   0.0% _CAST_set_key
       1   0.0% 100.0%     2588  34.6% _pgp_push_enc_crypt
       1   0.0% 100.0%        1   0.0% _pgp_push_stream_enc_se_ip
       1   0.0% 100.0%        1   0.0% _write

How you can see 78% CPU spends at memcpy that adds some optimism to fix it easy :)

cc: @ronaldtse @dewyatt

Anyway, right now I see few task for me:

catap commented 7 years ago

Well, it crashed inside zlib compression.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x14423e000)
  * frame #0: 0x00007fffb61c1fbf libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 287
    frame #1: 0x000000010009b21e librnp.0.dylib`zlib_compressed_data_reader(stream=0x0000000100702050, dest=0x0000000143800000, length=<unavailable>, errors=0x0000000100702158, readinfo=0x0000000100702090, cbinfo=0x00000001007020d8) at compress.c:192 [opt]
    frame #2: 0x00000001000a3bd5 librnp.0.dylib`sub_base_read(stream=0x0000000100702050, dest=0x0000000143800000, length=1073741824, errors=0x0000000100702158, readinfo=<unavailable>, cbinfo=0x00000001007020d8) at packet-parse.c:241 [opt]
    frame #3: 0x00000001000a3daf librnp.0.dylib`pgp_limited_read [inlined] full_read(stream=0x0000000100702050, dest="[...]", length=1073741824, errors=0x0000000100702158, readinfo=0x0000000100702090, cbinfo=<unavailable>) at packet-parse.c:326 [opt]
    frame #4: 0x00000001000a3d7d librnp.0.dylib`pgp_limited_read(stream=0x0000000100702050, dest="[...]", length=1073741824, region=0x00007fff5fbe9050, errors=0x0000000100702158, readinfo=0x0000000100702090, cbinfo=<unavailable>) at packet-parse.c:417 [opt]
    frame #5: 0x00000001000a621b librnp.0.dylib`parse_packet [inlined] limread(dest=<unavailable>, length=1073741824, region=<unavailable>, info=0x0000000100702050) at packet-parse.c:456 [opt]
    frame #6: 0x00000001000a61f4 librnp.0.dylib`parse_packet at packet-parse.c:2422 [opt]
    frame #7: 0x00000001000a6071 librnp.0.dylib`parse_packet(stream=0x0000000100702050, pktlen=<unavailable>) at packet-parse.c:3328 [opt]
    frame #8: 0x00000001000a533b librnp.0.dylib`pgp_parse(stream=0x0000000100702050, perrors=0) at packet-parse.c:3433 [opt]
    frame #9: 0x000000010009af16 librnp.0.dylib`pgp_decompress(region=0x00007fff5fbf0710, stream=0x0000000100702050, type=PGP_C_ZLIB) at compress.c:394 [opt]
    frame #10: 0x00000001000a6003 librnp.0.dylib`parse_packet [inlined] parse_compressed at packet-parse.c:2279 [opt]
    frame #11: 0x00000001000a5f66 librnp.0.dylib`parse_packet(stream=0x0000000100702050, pktlen=<unavailable>) at packet-parse.c:3320 [opt]
    frame #12: 0x00000001000aac8b librnp.0.dylib`decrypt_se_ip_data [inlined] pgp_parse(stream=0x0000000100702050, perrors=0) at packet-parse.c:3433 [opt]
    frame #13: 0x00000001000aac80 librnp.0.dylib`decrypt_se_ip_data(tag=PGP_PTAG_CT_SE_IP_DATA_BODY, region=<unavailable>, stream=<unavailable>) at packet-parse.c:3107 [opt]
    frame #14: 0x00000001000a64ba librnp.0.dylib`parse_packet [inlined] parse_se_ip_data at packet-parse.c:3188 [opt]
    frame #15: 0x00000001000a63d4 librnp.0.dylib`parse_packet(stream=0x0000000100702050, pktlen=<unavailable>) at packet-parse.c:3349 [opt]
    frame #16: 0x00000001000a533b librnp.0.dylib`pgp_parse(stream=0x0000000100702050, perrors=1) at packet-parse.c:3433 [opt]
    frame #17: 0x000000010009c77b librnp.0.dylib`pgp_decrypt_file(io=<unavailable>, infile=<unavailable>, outfile=<unavailable>, secring=0x0000000101200090, pubring=<unavailable>, use_armour=<unavailable>, allow_overwrite=<unavailable>, sshkeys=<unavailable>, passfp=<unavailable>, numtries=<unavailable>, getpassfunc=<unavailable>) at crypto.c:660 [opt]
    frame #18: 0x00000001000b931c librnp.0.dylib`rnp_decrypt_file(ctx=0x00007fff5fbff288, f="1g.bin.gpg", out=0x0000000000000000) at rnp.c:1184 [opt]
    frame #19: 0x0000000100002153 rnp`rnp_cmd(cfg=0x00007fff5fbff380, rnp=0x00007fff5fbff3a0, cmd=261, f="1g.bin.gpg") at rnp.c:337 [opt]
    frame #20: 0x0000000100001999 rnp`main(argc=<unavailable>, argv=<unavailable>) at rnp.c:684 [opt]
    frame #21: 0x00007fffb5faf235 libdyld.dylib`start + 1
    frame #22: 0x00007fffb5faf235 libdyld.dylib`start + 1
(lldb) up
librnp.0.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
frame #1: 0x000000010009b21e librnp.0.dylib`zlib_compressed_data_reader(stream=0x0000000100702050, dest=0x0000000143800000, length=<unavailable>, errors=0x0000000100702158, readinfo=0x0000000100702090, cbinfo=0x00000001007020d8) at compress.c:192 [opt]
   189          if (len + cc > length) {
   190              len = length - cc;
   191          }
-> 192          (void) memcpy(&cdest[cc], &z->out[z->offset], len);
   193          z->offset += len;
   194      }
   195  
(lldb)

And this bug doesn't related to encryption process, because it happened also at file that was encrypted by gnupg.

catap commented 7 years ago

as confirmation that crash is our bug I did a very simple test: encrypt a 1gb file by rnp and try to decrypt it by gpg.

GnuPG decrypts rnp's file.

ni4 commented 7 years ago

I doubt that memory usage can be easily decreased - current codebase doesn't use streamed processing. To change this we'll need to rewrite a lot of things in readers/writers.

catap commented 7 years ago

@ni4 yeah, but I think we should do this because gpg use a constant (about 1mb) memory.

ronaldtse commented 7 years ago

Agree with @catap that we must use iterative, block-size processing. Compression, encryption / decryption, signing / verification all work with such processing. @ni4 any particular part we should be concerned about?

ni4 commented 7 years ago

@ronaldtse I also completely agree that we should process data in streamed way using the constant memory. I just noting that this would require rewriting a lot of stuff inside of the library taking a lot of time. So in my opinion it would be better to do that not right now but after we have first release.

ronaldtse commented 7 years ago

That's true, I suppose we will do what we can incrementally without require a huge one-time rewrite.

dewyatt commented 7 years ago

There should be some low-hanging fruit here that could be improved upon immediately, though.

This would be a good candidate: https://github.com/riboseinc/rnp/blob/5c4376c298656debf8bc03cb0c8ca0c2aa8bd611/src/lib/signature.c#L1311

So for detached signing, we're reading the entire file into memory, just so we can hash it. That's obviously dumb and easily improved. A simple fread loop that calls pgp_sig_add_data is probably an improvement.

catap commented 7 years ago

@dewyatt well, pgp_mem_readfile is wrapper around mmap and I don't think that here maybe an issue.

But I agree that we can add to pgp_sig_add_data by fixed size chunk and it may helps.

anyway I think that good point is spend some time to find who uses so many memory and how we can fix it as easy workaround before release.

And after release rewrite to streaming everything as suggest @ni4

ronaldtse commented 7 years ago

@catap echoing @dewyatt it's not mmap that uses this memory immediately, it is the subsequent read calls that loads the full file contents.

https://github.com/riboseinc/rnp/blob/b9c993a46d4d1da12d5a24ac6bbb11620107f029/src/lib/misc.c#L765

I guess we need to either rewrite pgp_mem_readfile or ensure the chunk size concept is consistent from top to bottom. This is still rather 'low-hanging'? 😉

catap commented 7 years ago

@ronaldtse it runs read only if mmap fails:

https://github.com/riboseinc/rnp/blob/b9c993a46d4d1da12d5a24ac6bbb11620107f029/src/lib/misc.c#L756-L773

ronaldtse commented 7 years ago

Geez, my reading skills.

dewyatt commented 7 years ago

Right @catap is right here and I completely forgot we were using mmap here. :)

ronaldtse commented 7 years ago

We could mmap/munmap portions of the file chunk by chunk instead to prevent this issue.

This line mmaps the entire file size at once. https://github.com/riboseinc/rnp/blob/b9c993a46d4d1da12d5a24ac6bbb11620107f029/src/lib/misc.c#L755

dewyatt commented 7 years ago

I wonder if we could get a performance boost for cases like detached signing (where we just need to read+hash the file) with posix_madvise + POSIX_MADV_SEQUENTIAL. It seems to fit the use case perfectly, so maybe worth a shot.

ronaldtse commented 7 years ago

Having read this I wonder if a simple fgets is actually faster and uses less memory that what we do now?

ni4 commented 7 years ago

I doubt that file reading operation is the main bottleneck. We should just read it in large enough chunks (say, 64k or 128k) instead of 1-byte reads.

ronaldtse commented 7 years ago

Plenty of stuff to try, someone just needs to jump on this 👍

catap commented 7 years ago

@ronaldtse I'm on it, but right now I'm in 4gb+ files hell :)

catap commented 7 years ago

@dewyatt, @flowher, @ni4, @ronaldtse I think I need your help.

While I've worked at #421 I explain intresting behaviour of GnuPG.

It's using partial/streaming writing for any packages that bigger than 8kb.

For example

➜  rnp-gpg-test pgpdump 100k.bin.gpg                                          
Old: Public-Key Encrypted Session Key Packet(tag 1)(268 bytes)
    New version(3)
    Key ID - 0x9EBD8590F664AD37
    Pub alg - RSA Encrypt or Sign(pub 1)
    RSA m^e mod n(2048 bits) - ...
        -> m = sym alg(1 byte) + checksum(2 bytes) + PKCS-1 block type 02
New: Symmetrically Encrypted and MDC Packet(tag 18)(8192 bytes) partial start
    Ver 1
    Encrypted data [sym alg is specified in pub-key encrypted session key]
        (plain text + MDC SHA1(20 bytes))
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (8192 bytes) partial continue
New:    (4096 bytes) partial continue
New:    (103 bytes) partial end
➜  rnp-gpg-test 

I made a small research how it works and realised that GnuPG creates partial chunk with size no more than 8kb, if it smaller, it try creates an n^2 chunk size, but if it less than 512 it creates last chunk with this size.

RFC 4880 describes that partial chunk might be up to 1Gb and existed rnp code (and netpgp I think) supported this 1Gb chunk.

I think the main reason of this 8kb limit is some sort of legacy but IMO switching to GnuPG way may us help decrease memory using and I haven't see any issue with this.

Do you see anything?

ni4 commented 7 years ago

@catap This may help abit. However I started writing streamed processing code. This will take some time but after that we'll use constant memory footprint for all of the operations.

catap commented 7 years ago

@ni4 have you mean decoding code? Because rnp has encoding code :)

ni4 commented 7 years ago

@catap Both, reader and writer.

catap commented 7 years ago

@ni4 but #421 uses a writer...

ni4 commented 7 years ago

@catap Probably I used wrong description here, I meant #424 - like approach: I.e. we process data in limited-size block from the input to the output. While now, for instance, in writer we write the whole literal data packet to the memory, then compress it, then write to output.

kriskwiatkowski commented 7 years ago

@ni4 @catap As far as I understand partial chunks will solve the problem mentioned in 424, isn't it?

"For PGP data processing it would be a bit different since we don't know the packet contents from the beginning, so should insert some 'decision makers' which parse data and create processors on the go.",

catap commented 7 years ago

@flowher yes, and you're just approved a PR where I used existed code of partial request. Anyway I haven't saw #424 before, and started working on reader part, because writer part existed...

Have I missed anything?

kriskwiatkowski commented 7 years ago

Yep, the 421. Nothing missed. I just think that using partial chunks for reading and writing is good idea (I thought that streaming will be kind of solved by using partial chunks).

ni4 commented 7 years ago

@catap @flowher Now I finally got an idea, sorry, was a bit tired. But what happen if some implementation will write the 1Gb partial chunk? Or put in a z-bomb with say 1kb of zlib data which decompresses to 10gb?

kriskwiatkowski commented 7 years ago

Implementation needs to take care of the case that 1 partial block of input doesn't mean 1 partial block of output. But I would expect that implementation is processing one partial block (of input) at a time. So,

input file is read to memory by chunks (say, 64k blocks)

I would say this will be a partial chunk How about that?

catap commented 7 years ago

@ni4 GnuPG can encrypt and decrypt (not all time to be honest) very big file. I never tried encode/decode files bigger than 1tb, but it works for up to 1 tb.

and yes, it creates a lot of 8kb chunk :)

Have you in mind if I finish partital decoder on this task? :)

ronaldtse commented 7 years ago

GnuPG probably splits into 8KB packets for streaming purposes? I'm fine with using the GnuPG approach, but maybe the packet size could be tunable.

I do agree that #424 should be implemented in case someone gives us a 1GB packet, I believe @ni4 is working on that.

ni4 commented 7 years ago

@ronaldtse Yep, that's what I'm working on right now. Starting from reader, writer will be next. Will first implement this stuff for symmetric encryption/decryption.

ronaldtse commented 7 years ago

@catap just posted this gist https://gist.github.com/catap/f909b1d221f93c446edff03832fba075. rnp can decrypt a 100k file from GnuPG but not yet the 4gb file.

Hopefully we can solve this soon.