PEM/DER decoding of PKCS8 RSA private keys are 80 times slower in 3.0

DirtYiCE commented 3 years ago

We have two sample programs to parse PKCS8 RSA private keys. First to parse one in PEM format:

#include <openssl/evp.h>
#include <openssl/x509.h>
#include <openssl/pem.h>

static char data[] = "-----BEGIN PRIVATE KEY-----\nMIIJQQIBADANBgkqhkiG9w0BAQEFAASCCSswggknAgEAAoICAQCqD9qzDx5C7niP8lvd8uXwQNHL\nJjakRyEzfbV5FJtJujgOXjF5POm7WICAgtdrct+g5v/SfYvgNkbs06E7th1zFY7cmEimJOIVuDsG\nlMUySlYioxW9QvTERgq3e9NONKYPEabGwMzh6ZYjHYSaKZUDAMlV58XhXLIf4+AAyVoSF7jihbs5\nGb+noksQ4alKxiJDKr/9ijfC+HMj4VQEc41Mib78+ImXMRzPuCj2IPSorY4svMsWLgjm8+4FV/Ps\nSkd2zOeSffeUm6Oa4F0rWnmOyRk6WBKIv6OW2mj3OTh6f7VOBy3tn9xUkY+wFcY6OrFfwgbH2kJ+\n5aibj00mrL4FXLLVKbYp5YAZFvEvujKuAX2M7ZocS95XHwfbQK3ol6wCNK75NLOTroVQtlH/5gcN\niJtxTiW81hwY17IhPwVBhq1ynb+9YEzNoUuqfsEvhtHeRLdU/YYOiXXQVESCYXgQ5ApFRExfVu1t\n8a/MibNIPM+nJIcJh80BNbl2fL9ShMJXuBgRz/Yyp//1NNoC8/MkO/vAaZBCrStOZStZLRyi0Aye\nvvw0a7IIS82h4R6cjte1Vfnd+eX4MOd98a/AHJ/4ecp9lIySQyVf5NrcGTz1A5LtyZdi4teGwBl+\nX2qp4RMu3uJE5+T1xBGRO0wdqJeGLw9jCck9wHP4d6DxUR7J+wIDAQABAoICAA1ytIItY2C6l+kW\nKsKZ5yoBDjYI3xBNmagHPFcHVKJXagBk3XevY/JPNNY0wpE6I8oHClrcV7fSwvgOYjUlGR4VKddy\n6WfOCdza1TwXfqKT80zI5bqyNUXiHg3VamfITQtrA2u7KliBDsDXIDnKqQB0SveSnPjNyj4wWHHn\nekps+s9a8Ou6iAfbEyGSHr+NfH8gPc9wYyl1WTGQq4Kwmo9fYy2A/+xnU1Zvwzl3cLF4DAKoqUyn\nNkgBKSTeCCl61DzmRje042OqjRz9uhBoFq2+ZFLTTR/oO6j9u4g1S6yQNcemVLDyT8uWOS0dA7Bu\nHMGsR7n9Hf4H7jXi9qBkz5/e5mDUHp79ccDUW3gTHDz0Mw15A9JXJd7R1XNkCvNmrTMN9eCDdQRW\nrPMryIX9+ljVswxjrSnSnTOa69Xmc/z90LEgCgYmUhyiax+F1T0AWycMwCdz67BO+rLYuEbcOBo0\nh35uVUeNwrerpFt7kUHBMBRtxkozoAvy1eZEAKifIcceeEkZGQDlzbUjXSAXaSWa3XQFq+lCu5Kn\n88tgjEDSLU8jVIfxdCO3y7pSCqzf02T9f91tb4f3zi/nxc19Ksc5tbC1vj1hMeOKLdDT02r8ygkr\n4Mrfc6AaXQWYpRapx3OA8tNydCnUW1LXq7rQsC12JYn0IKAa5bFx7Ecmi/2JAoIBAQDIZ0es+2D/\nbLVBH6DOGP/eAZzeILwbMOvjm1hB/us6zwTY4MIKWGiA1dkWF4ILNSaGLIbZ+sO6tLzL0LtkynqG\nuWm/25YY8sccwCA6cy8AQ9adN2zJUyWlD+O6tJ0PRim6nV5lnKAmgzSBX6BcTT7GN3BaB/u2JlBq\ncf2F2gkz2eyyOHPKvmT/DoEeTaTSDyAuBAN6lrgTfFI8cszPkMky5dlFz+9kZEC/GKJ33cgYTWxi\nMhKxN+TwIBulghM/LgZ2atgQ0VdRTkqfcSrG+i3XkzhQHc8VAVHzfkHqe1kOfv4ZvFy5QhTOAF0L\nk+sM8P7MIy+PMexpcPqXY89maBQpAoIBAQDZPbXR6tFkTDt/QqGjytbk7CjDzt1KIZPZ4EmFkeXL\nSQjC7NfJu+w8tmpG9V8hN/dCQo2NNaSTNwnX/eikOC/OJuqf11mr3JV+81N82fOz0c6nqZaMBQ7Y\ncZcr9/77pzmACIEe3hY5g+oKO54ENt8h/FYEYPjtn01WyKpXciLFo1Lt/qjxYUwNPRVKrICvDjmN\nCeWhZmpvjl0fyFB6tXLlcAL3Ocp5rbMzhf2emisnnXwbrPmgzmkJ/Wn0wpgk+8SxKA3eeeWKXWgb\nSMs/bbBEIM6MOQ1un8Seg0fVpSGBvCnVtFkwT27nCv6j7COrYkE6BZxj0K5NzyH0aA4LD9GDAoIB\nADuD6JZnxUO1/hJMGU57wCknY3XYVOTiX3ul280lrqg1aOQbw6Sc4tQ4LhNQge9gJoO8X4QG4+/j\n0xnYcH6bX035bH1s8iOQni9co3WYVYIHo4nnNuiHR+vAT0pYbzhlBumD6M/Wdv1ZA9PUGWSwEA9/\n0V77dfZ/ZGxoU/lXalo6wv+eoky4xHe20AO23VcA5PalfH8AmcQ3rJiFI2wVPJtgBWmlOhwfZdca\nss1UUSNeguyaoFB/H/9sGanKenrN6V9rlaVQ9lSQIrs9OY4EKG8YKqYoZCKB1NuySFMhtK4IauAr\nv4HJLTKMixVwJWMfgxwO6wXktqgNxG4HV0W7bRkCggEAYzixS8BxhNrgrd5UD4h8oDBQ6iYYolw1\nuGSdj/k0OKYR713XrVc8rfovDlvR6E00jLnzBxUCJw8TWuiokiDrjL/vl7P3S+zDBynB7xtpGK9y\nMNffX/KLdkZjYnyxpGUbeSPpPZz4D6r1gVj7cjdRsKcc7oEQERAaddHPI4OI6DYRkYwnw5/J6Z4F\nlIa3e70GgimMDSzG3k7qr7KBN5qacLq5UAvAM9UnLRg832zQ2xYt8kIN/eloxlxNQbKDZRjtHHEL\n7JpGQe0puJSF6GGECYnmbNs+DFHCrxeM/sKeTDAR936Y4dzV7YbzCRG4tPV6jzKy3FAa3IUHoCbK\nizjdWwKCAQBc7bQBV7ERpzvd9Ur5Ra7lIxW+0X9gD2U+nIdc7kt5NtvegOFu+OxlAbx7vfm5mx/m\nU9ugBgoUsProO4H6dLfyBbNpydHJvfkycJ65KiBgtr11Upm+8deeZTlckL3Aj2wQGJBKzKyDjBOw\nIHPCCPEczAFc/J6J3JPGcVK9ZCOAUQ2ZiuUl6TItCOCZJ2rpRJS8mpD3Am8dan2fz2UZSN7Xhnvl\nMTip8/M7UmYn9W9IaOybqa8STXBetJExkI2UsdQgZWXyuYXddCcVqPsrHe43leHUZKJq+H8OBdUU\nRCQED9fnCZSloFH2pGE9tQzlbktWmFmqsJib+X70SfLtkpRe\n-----END PRIVATE KEY-----\n";

int main(void)
{
  BIO* bio = BIO_new_mem_buf(data, -1);

  for (int i = 0; i < 1000; ++i) {
    BIO_reset(bio);
    EVP_PKEY* pkey = PEM_read_bio_PrivateKey(bio, NULL, NULL, NULL);
    EVP_PKEY_free(pkey);
  }

  BIO_free(bio);
}

With OpenSSL 1.1.1k, it takes about 0.017s to execute, while with master (6ef2f71ac70aff99da277be4a554e3b1fe739050) it takes about 1.4s.

Next one to decode a DER formatted private key:

#include <openssl/evp.h>
#include <openssl/x509.h>
#include <openssl/pem.h>

static char data[] = "\x30\x82\x09\x41\x02\x01\x00\x30\x0d\x06\x09\x2a\x86\x48\x86\xf7\x0d\x01\x01\x01\x05\x00\x04\x82\x09\x2b\x30\x82\x09\x27\x02\x01\x00\x02\x82\x02\x01\x00\xaa\x0f\xda\xb3\x0f\x1e\x42\xee\x78\x8f\xf2\x5b\xdd\xf2\xe5\xf0\x40\xd1\xcb\x26\x36\xa4\x47\x21\x33\x7d\xb5\x79\x14\x9b\x49\xba\x38\x0e\x5e\x31\x79\x3c\xe9\xbb\x58\x80\x80\x82\xd7\x6b\x72\xdf\xa0\xe6\xff\xd2\x7d\x8b\xe0\x36\x46\xec\xd3\xa1\x3b\xb6\x1d\x73\x15\x8e\xdc\x98\x48\xa6\x24\xe2\x15\xb8\x3b\x06\x94\xc5\x32\x4a\x56\x22\xa3\x15\xbd\x42\xf4\xc4\x46\x0a\xb7\x7b\xd3\x4e\x34\xa6\x0f\x11\xa6\xc6\xc0\xcc\xe1\xe9\x96\x23\x1d\x84\x9a\x29\x95\x03\x00\xc9\x55\xe7\xc5\xe1\x5c\xb2\x1f\xe3\xe0\x00\xc9\x5a\x12\x17\xb8\xe2\x85\xbb\x39\x19\xbf\xa7\xa2\x4b\x10\xe1\xa9\x4a\xc6\x22\x43\x2a\xbf\xfd\x8a\x37\xc2\xf8\x73\x23\xe1\x54\x04\x73\x8d\x4c\x89\xbe\xfc\xf8\x89\x97\x31\x1c\xcf\xb8\x28\xf6\x20\xf4\xa8\xad\x8e\x2c\xbc\xcb\x16\x2e\x08\xe6\xf3\xee\x05\x57\xf3\xec\x4a\x47\x76\xcc\xe7\x92\x7d\xf7\x94\x9b\xa3\x9a\xe0\x5d\x2b\x5a\x79\x8e\xc9\x19\x3a\x58\x12\x88\xbf\xa3\x96\xda\x68\xf7\x39\x38\x7a\x7f\xb5\x4e\x07\x2d\xed\x9f\xdc\x54\x91\x8f\xb0\x15\xc6\x3a\x3a\xb1\x5f\xc2\x06\xc7\xda\x42\x7e\xe5\xa8\x9b\x8f\x4d\x26\xac\xbe\x05\x5c\xb2\xd5\x29\xb6\x29\xe5\x80\x19\x16\xf1\x2f\xba\x32\xae\x01\x7d\x8c\xed\x9a\x1c\x4b\xde\x57\x1f\x07\xdb\x40\xad\xe8\x97\xac\x02\x34\xae\xf9\x34\xb3\x93\xae\x85\x50\xb6\x51\xff\xe6\x07\x0d\x88\x9b\x71\x4e\x25\xbc\xd6\x1c\x18\xd7\xb2\x21\x3f\x05\x41\x86\xad\x72\x9d\xbf\xbd\x60\x4c\xcd\xa1\x4b\xaa\x7e\xc1\x2f\x86\xd1\xde\x44\xb7\x54\xfd\x86\x0e\x89\x75\xd0\x54\x44\x82\x61\x78\x10\xe4\x0a\x45\x44\x4c\x5f\x56\xed\x6d\xf1\xaf\xcc\x89\xb3\x48\x3c\xcf\xa7\x24\x87\x09\x87\xcd\x01\x35\xb9\x76\x7c\xbf\x52\x84\xc2\x57\xb8\x18\x11\xcf\xf6\x32\xa7\xff\xf5\x34\xda\x02\xf3\xf3\x24\x3b\xfb\xc0\x69\x90\x42\xad\x2b\x4e\x65\x2b\x59\x2d\x1c\xa2\xd0\x0c\x9e\xbe\xfc\x34\x6b\xb2\x08\x4b\xcd\xa1\xe1\x1e\x9c\x8e\xd7\xb5\x55\xf9\xdd\xf9\xe5\xf8\x30\xe7\x7d\xf1\xaf\xc0\x1c\x9f\xf8\x79\xca\x7d\x94\x8c\x92\x43\x25\x5f\xe4\xda\xdc\x19\x3c\xf5\x03\x92\xed\xc9\x97\x62\xe2\xd7\x86\xc0\x19\x7e\x5f\x6a\xa9\xe1\x13\x2e\xde\xe2\x44\xe7\xe4\xf5\xc4\x11\x91\x3b\x4c\x1d\xa8\x97\x86\x2f\x0f\x63\x09\xc9\x3d\xc0\x73\xf8\x77\xa0\xf1\x51\x1e\xc9\xfb\x02\x03\x01\x00\x01\x02\x82\x02\x00\x0d\x72\xb4\x82\x2d\x63\x60\xba\x97\xe9\x16\x2a\xc2\x99\xe7\x2a\x01\x0e\x36\x08\xdf\x10\x4d\x99\xa8\x07\x3c\x57\x07\x54\xa2\x57\x6a\x00\x64\xdd\x77\xaf\x63\xf2\x4f\x34\xd6\x34\xc2\x91\x3a\x23\xca\x07\x0a\x5a\xdc\x57\xb7\xd2\xc2\xf8\x0e\x62\x35\x25\x19\x1e\x15\x29\xd7\x72\xe9\x67\xce\x09\xdc\xda\xd5\x3c\x17\x7e\xa2\x93\xf3\x4c\xc8\xe5\xba\xb2\x35\x45\xe2\x1e\x0d\xd5\x6a\x67\xc8\x4d\x0b\x6b\x03\x6b\xbb\x2a\x58\x81\x0e\xc0\xd7\x20\x39\xca\xa9\x00\x74\x4a\xf7\x92\x9c\xf8\xcd\xca\x3e\x30\x58\x71\xe7\x7a\x4a\x6c\xfa\xcf\x5a\xf0\xeb\xba\x88\x07\xdb\x13\x21\x92\x1e\xbf\x8d\x7c\x7f\x20\x3d\xcf\x70\x63\x29\x75\x59\x31\x90\xab\x82\xb0\x9a\x8f\x5f\x63\x2d\x80\xff\xec\x67\x53\x56\x6f\xc3\x39\x77\x70\xb1\x78\x0c\x02\xa8\xa9\x4c\xa7\x36\x48\x01\x29\x24\xde\x08\x29\x7a\xd4\x3c\xe6\x46\x37\xb4\xe3\x63\xaa\x8d\x1c\xfd\xba\x10\x68\x16\xad\xbe\x64\x52\xd3\x4d\x1f\xe8\x3b\xa8\xfd\xbb\x88\x35\x4b\xac\x90\x35\xc7\xa6\x54\xb0\xf2\x4f\xcb\x96\x39\x2d\x1d\x03\xb0\x6e\x1c\xc1\xac\x47\xb9\xfd\x1d\xfe\x07\xee\x35\xe2\xf6\xa0\x64\xcf\x9f\xde\xe6\x60\xd4\x1e\x9e\xfd\x71\xc0\xd4\x5b\x78\x13\x1c\x3c\xf4\x33\x0d\x79\x03\xd2\x57\x25\xde\xd1\xd5\x73\x64\x0a\xf3\x66\xad\x33\x0d\xf5\xe0\x83\x75\x04\x56\xac\xf3\x2b\xc8\x85\xfd\xfa\x58\xd5\xb3\x0c\x63\xad\x29\xd2\x9d\x33\x9a\xeb\xd5\xe6\x73\xfc\xfd\xd0\xb1\x20\x0a\x06\x26\x52\x1c\xa2\x6b\x1f\x85\xd5\x3d\x00\x5b\x27\x0c\xc0\x27\x73\xeb\xb0\x4e\xfa\xb2\xd8\xb8\x46\xdc\x38\x1a\x34\x87\x7e\x6e\x55\x47\x8d\xc2\xb7\xab\xa4\x5b\x7b\x91\x41\xc1\x30\x14\x6d\xc6\x4a\x33\xa0\x0b\xf2\xd5\xe6\x44\x00\xa8\x9f\x21\xc7\x1e\x78\x49\x19\x19\x00\xe5\xcd\xb5\x23\x5d\x20\x17\x69\x25\x9a\xdd\x74\x05\xab\xe9\x42\xbb\x92\xa7\xf3\xcb\x60\x8c\x40\xd2\x2d\x4f\x23\x54\x87\xf1\x74\x23\xb7\xcb\xba\x52\x0a\xac\xdf\xd3\x64\xfd\x7f\xdd\x6d\x6f\x87\xf7\xce\x2f\xe7\xc5\xcd\x7d\x2a\xc7\x39\xb5\xb0\xb5\xbe\x3d\x61\x31\xe3\x8a\x2d\xd0\xd3\xd3\x6a\xfc\xca\x09\x2b\xe0\xca\xdf\x73\xa0\x1a\x5d\x05\x98\xa5\x16\xa9\xc7\x73\x80\xf2\xd3\x72\x74\x29\xd4\x5b\x52\xd7\xab\xba\xd0\xb0\x2d\x76\x25\x89\xf4\x20\xa0\x1a\xe5\xb1\x71\xec\x47\x26\x8b\xfd\x89\x02\x82\x01\x01\x00\xc8\x67\x47\xac\xfb\x60\xff\x6c\xb5\x41\x1f\xa0\xce\x18\xff\xde\x01\x9c\xde\x20\xbc\x1b\x30\xeb\xe3\x9b\x58\x41\xfe\xeb\x3a\xcf\x04\xd8\xe0\xc2\x0a\x58\x68\x80\xd5\xd9\x16\x17\x82\x0b\x35\x26\x86\x2c\x86\xd9\xfa\xc3\xba\xb4\xbc\xcb\xd0\xbb\x64\xca\x7a\x86\xb9\x69\xbf\xdb\x96\x18\xf2\xc7\x1c\xc0\x20\x3a\x73\x2f\x00\x43\xd6\x9d\x37\x6c\xc9\x53\x25\xa5\x0f\xe3\xba\xb4\x9d\x0f\x46\x29\xba\x9d\x5e\x65\x9c\xa0\x26\x83\x34\x81\x5f\xa0\x5c\x4d\x3e\xc6\x37\x70\x5a\x07\xfb\xb6\x26\x50\x6a\x71\xfd\x85\xda\x09\x33\xd9\xec\xb2\x38\x73\xca\xbe\x64\xff\x0e\x81\x1e\x4d\xa4\xd2\x0f\x20\x2e\x04\x03\x7a\x96\xb8\x13\x7c\x52\x3c\x72\xcc\xcf\x90\xc9\x32\xe5\xd9\x45\xcf\xef\x64\x64\x40\xbf\x18\xa2\x77\xdd\xc8\x18\x4d\x6c\x62\x32\x12\xb1\x37\xe4\xf0\x20\x1b\xa5\x82\x13\x3f\x2e\x06\x76\x6a\xd8\x10\xd1\x57\x51\x4e\x4a\x9f\x71\x2a\xc6\xfa\x2d\xd7\x93\x38\x50\x1d\xcf\x15\x01\x51\xf3\x7e\x41\xea\x7b\x59\x0e\x7e\xfe\x19\xbc\x5c\xb9\x42\x14\xce\x00\x5d\x0b\x93\xeb\x0c\xf0\xfe\xcc\x23\x2f\x8f\x31\xec\x69\x70\xfa\x97\x63\xcf\x66\x68\x14\x29\x02\x82\x01\x01\x00\xd9\x3d\xb5\xd1\xea\xd1\x64\x4c\x3b\x7f\x42\xa1\xa3\xca\xd6\xe4\xec\x28\xc3\xce\xdd\x4a\x21\x93\xd9\xe0\x49\x85\x91\xe5\xcb\x49\x08\xc2\xec\xd7\xc9\xbb\xec\x3c\xb6\x6a\x46\xf5\x5f\x21\x37\xf7\x42\x42\x8d\x8d\x35\xa4\x93\x37\x09\xd7\xfd\xe8\xa4\x38\x2f\xce\x26\xea\x9f\xd7\x59\xab\xdc\x95\x7e\xf3\x53\x7c\xd9\xf3\xb3\xd1\xce\xa7\xa9\x96\x8c\x05\x0e\xd8\x71\x97\x2b\xf7\xfe\xfb\xa7\x39\x80\x08\x81\x1e\xde\x16\x39\x83\xea\x0a\x3b\x9e\x04\x36\xdf\x21\xfc\x56\x04\x60\xf8\xed\x9f\x4d\x56\xc8\xaa\x57\x72\x22\xc5\xa3\x52\xed\xfe\xa8\xf1\x61\x4c\x0d\x3d\x15\x4a\xac\x80\xaf\x0e\x39\x8d\x09\xe5\xa1\x66\x6a\x6f\x8e\x5d\x1f\xc8\x50\x7a\xb5\x72\xe5\x70\x02\xf7\x39\xca\x79\xad\xb3\x33\x85\xfd\x9e\x9a\x2b\x27\x9d\x7c\x1b\xac\xf9\xa0\xce\x69\x09\xfd\x69\xf4\xc2\x98\x24\xfb\xc4\xb1\x28\x0d\xde\x79\xe5\x8a\x5d\x68\x1b\x48\xcb\x3f\x6d\xb0\x44\x20\xce\x8c\x39\x0d\x6e\x9f\xc4\x9e\x83\x47\xd5\xa5\x21\x81\xbc\x29\xd5\xb4\x59\x30\x4f\x6e\xe7\x0a\xfe\xa3\xec\x23\xab\x62\x41\x3a\x05\x9c\x63\xd0\xae\x4d\xcf\x21\xf4\x68\x0e\x0b\x0f\xd1\x83\x02\x82\x01\x00\x3b\x83\xe8\x96\x67\xc5\x43\xb5\xfe\x12\x4c\x19\x4e\x7b\xc0\x29\x27\x63\x75\xd8\x54\xe4\xe2\x5f\x7b\xa5\xdb\xcd\x25\xae\xa8\x35\x68\xe4\x1b\xc3\xa4\x9c\xe2\xd4\x38\x2e\x13\x50\x81\xef\x60\x26\x83\xbc\x5f\x84\x06\xe3\xef\xe3\xd3\x19\xd8\x70\x7e\x9b\x5f\x4d\xf9\x6c\x7d\x6c\xf2\x23\x90\x9e\x2f\x5c\xa3\x75\x98\x55\x82\x07\xa3\x89\xe7\x36\xe8\x87\x47\xeb\xc0\x4f\x4a\x58\x6f\x38\x65\x06\xe9\x83\xe8\xcf\xd6\x76\xfd\x59\x03\xd3\xd4\x19\x64\xb0\x10\x0f\x7f\xd1\x5e\xfb\x75\xf6\x7f\x64\x6c\x68\x53\xf9\x57\x6a\x5a\x3a\xc2\xff\x9e\xa2\x4c\xb8\xc4\x77\xb6\xd0\x03\xb6\xdd\x57\x00\xe4\xf6\xa5\x7c\x7f\x00\x99\xc4\x37\xac\x98\x85\x23\x6c\x15\x3c\x9b\x60\x05\x69\xa5\x3a\x1c\x1f\x65\xd7\x1a\xb2\xcd\x54\x51\x23\x5e\x82\xec\x9a\xa0\x50\x7f\x1f\xff\x6c\x19\xa9\xca\x7a\x7a\xcd\xe9\x5f\x6b\x95\xa5\x50\xf6\x54\x90\x22\xbb\x3d\x39\x8e\x04\x28\x6f\x18\x2a\xa6\x28\x64\x22\x81\xd4\xdb\xb2\x48\x53\x21\xb4\xae\x08\x6a\xe0\x2b\xbf\x81\xc9\x2d\x32\x8c\x8b\x15\x70\x25\x63\x1f\x83\x1c\x0e\xeb\x05\xe4\xb6\xa8\x0d\xc4\x6e\x07\x57\x45\xbb\x6d\x19\x02\x82\x01\x00\x63\x38\xb1\x4b\xc0\x71\x84\xda\xe0\xad\xde\x54\x0f\x88\x7c\xa0\x30\x50\xea\x26\x18\xa2\x5c\x35\xb8\x64\x9d\x8f\xf9\x34\x38\xa6\x11\xef\x5d\xd7\xad\x57\x3c\xad\xfa\x2f\x0e\x5b\xd1\xe8\x4d\x34\x8c\xb9\xf3\x07\x15\x02\x27\x0f\x13\x5a\xe8\xa8\x92\x20\xeb\x8c\xbf\xef\x97\xb3\xf7\x4b\xec\xc3\x07\x29\xc1\xef\x1b\x69\x18\xaf\x72\x30\xd7\xdf\x5f\xf2\x8b\x76\x46\x63\x62\x7c\xb1\xa4\x65\x1b\x79\x23\xe9\x3d\x9c\xf8\x0f\xaa\xf5\x81\x58\xfb\x72\x37\x51\xb0\xa7\x1c\xee\x81\x10\x11\x10\x1a\x75\xd1\xcf\x23\x83\x88\xe8\x36\x11\x91\x8c\x27\xc3\x9f\xc9\xe9\x9e\x05\x94\x86\xb7\x7b\xbd\x06\x82\x29\x8c\x0d\x2c\xc6\xde\x4e\xea\xaf\xb2\x81\x37\x9a\x9a\x70\xba\xb9\x50\x0b\xc0\x33\xd5\x27\x2d\x18\x3c\xdf\x6c\xd0\xdb\x16\x2d\xf2\x42\x0d\xfd\xe9\x68\xc6\x5c\x4d\x41\xb2\x83\x65\x18\xed\x1c\x71\x0b\xec\x9a\x46\x41\xed\x29\xb8\x94\x85\xe8\x61\x84\x09\x89\xe6\x6c\xdb\x3e\x0c\x51\xc2\xaf\x17\x8c\xfe\xc2\x9e\x4c\x30\x11\xf7\x7e\x98\xe1\xdc\xd5\xed\x86\xf3\x09\x11\xb8\xb4\xf5\x7a\x8f\x32\xb2\xdc\x50\x1a\xdc\x85\x07\xa0\x26\xca\x8b\x38\xdd\x5b\x02\x82\x01\x00\x5c\xed\xb4\x01\x57\xb1\x11\xa7\x3b\xdd\xf5\x4a\xf9\x45\xae\xe5\x23\x15\xbe\xd1\x7f\x60\x0f\x65\x3e\x9c\x87\x5c\xee\x4b\x79\x36\xdb\xde\x80\xe1\x6e\xf8\xec\x65\x01\xbc\x7b\xbd\xf9\xb9\x9b\x1f\xe6\x53\xdb\xa0\x06\x0a\x14\xb0\xfa\xe8\x3b\x81\xfa\x74\xb7\xf2\x05\xb3\x69\xc9\xd1\xc9\xbd\xf9\x32\x70\x9e\xb9\x2a\x20\x60\xb6\xbd\x75\x52\x99\xbe\xf1\xd7\x9e\x65\x39\x5c\x90\xbd\xc0\x8f\x6c\x10\x18\x90\x4a\xcc\xac\x83\x8c\x13\xb0\x20\x73\xc2\x08\xf1\x1c\xcc\x01\x5c\xfc\x9e\x89\xdc\x93\xc6\x71\x52\xbd\x64\x23\x80\x51\x0d\x99\x8a\xe5\x25\xe9\x32\x2d\x08\xe0\x99\x27\x6a\xe9\x44\x94\xbc\x9a\x90\xf7\x02\x6f\x1d\x6a\x7d\x9f\xcf\x65\x19\x48\xde\xd7\x86\x7b\xe5\x31\x38\xa9\xf3\xf3\x3b\x52\x66\x27\xf5\x6f\x48\x68\xec\x9b\xa9\xaf\x12\x4d\x70\x5e\xb4\x91\x31\x90\x8d\x94\xb1\xd4\x20\x65\x65\xf2\xb9\x85\xdd\x74\x27\x15\xa8\xfb\x2b\x1d\xee\x37\x95\xe1\xd4\x64\xa2\x6a\xf8\x7f\x0e\x05\xd5\x14\x44\x24\x04\x0f\xd7\xe7\x09\x94\xa5\xa0\x51\xf6\xa4\x61\x3d\xb5\x0c\xe5\x6e\x4b\x56\x98\x59\xaa\xb0\x98\x9b\xf9\x7e\xf4\x49\xf2\xed\x92\x94\x5e";

int main(void)
{
  BIO* bio = BIO_new_mem_buf(data, 2373);
  PKCS8_PRIV_KEY_INFO* info = d2i_PKCS8_PRIV_KEY_INFO_bio(bio, NULL);
  BIO_free(bio);

  for (int i = 0; i < 1000; ++i) {
    EVP_PKEY* privKey = EVP_PKCS82PKEY(info);
    EVP_PKEY_free(privKey);
  }

  PKCS8_PRIV_KEY_INFO_free(info);
}

This takes about 0.005s with 1.1.1k, while it takes about 0.44s with master. I'm using x64 linux, with Core i7-8700K.

hlandau commented 2 years ago

Profiling this, getrn, __pthread_rwlock_rdlock, __pthread_rwlock_unlock take unusual amounts of time (18.85%, 15.79%, 15.65%) in atomic operations. Particularly strange given this test program is single threaded.

In callgrind, ossl_lib_ctx_get_data takes 20%, so this will be helped by #17116.

mattcaswell commented 2 years ago

Yes the OSSL_LIB_CTX locking is particularly horrific (as mentioned in #17116) and ossl_lib_ctx_get_data will be a primary path for hitting that code - so its not a big surprise that we're running into problems as a result.

hlandau commented 2 years ago

Incorporating my recent performance fixes, as well as my libctx refactor, this drops from 1.315s in 3.0 to 0.982s (1.34x speedup). 1.1 is still 0.038s, so this remains an issue.

levitte commented 2 years ago

1.1 is still 0.038s, so this remains an issue.

I don't see that change any time soon :wink:

krk commented 1 year ago

https://github.com/openssl/openssl/pull/17881 fixes the lock contention in ossl_lib_ctx_get_data.

hannob commented 1 year ago

Any chance this will get some attention, or is there any workaround? I am surprised such a massive regression made it into a release.

I have an application where I parse a large number of private keys, and it turns out it is practically unusable in openssl 3.0 and close to unusable in 3.1. This is using python cryptography, but digging down it appears the problem is within openssl itself.

I did some rough benchmarks, and with 3.0 there's a 7000% slowdown. With 3.1 it's "better" with "only" a slowdown of 1000%.

beldmit commented 1 year ago

@hannob are you able to provide a synthetic benchmark test?

hannob commented 1 year ago

key.c:

#include <openssl/evp.h>
#include <openssl/pem.h>

int main() {
  EVP_PKEY *key;
  FILE *f;
  int i;

  for (i = 0; i < 100000; i++) {
    f = fopen("test.key", "r");
    if (f == NULL) {
      printf("cannot open test.key\n");
      return 1;
    }
    key = PEM_read_PrivateKey(f, NULL, NULL, NULL);
    fclose(f);
  }
}

test.key:

Run:

gcc key.c -lcrypto
time ./a.out

On my laptop it's around 0.7 seconds with openssl 1.1, 40 seconds with 3.0 and 7 seconds with 3.1.

beldmit commented 1 year ago

Could you also please try 3.2? Do I correctly understand that it's ARM?

hannob commented 1 year ago

With 3.2 you're referring to the current git code? I'll try. (AFAIK there is no 3.2 release or alpha/beta yet, right?) No, this is a normal Intel x86/64bit cpu.

beldmit commented 1 year ago

Yes, the current master, sorry for unclarity

hannob commented 1 year ago

Linking against current git master it's ~12 seconds (so it got worse compared to 3.1).

beldmit commented 1 year ago

It's quite strange. Are you able to get any profiling data?

t-j-h commented 1 year ago

It is trivial to use that test case and get profiling data.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 12.26      8.12     8.12 529000395     0.00     0.00  OPENSSL_sk_value
  9.24     14.25     6.12 200000000     0.00     0.00  collect_extra_decoder
  5.51     17.90     3.65 204300000     0.00     0.00  ossl_decoder_fast_is_a
  4.93     21.17     3.27 50002087     0.00     0.00  ossl_lh_strcasehash
  3.98     23.80     2.63 266900000     0.00     0.00  sk_OSSL_DECODER_INSTANCE_value
  3.93     26.41     2.60   100000     0.00     0.00  OSSL_DECODER_CTX_add_extra
  3.56     28.77     2.36 58607221     0.00     0.00  getrn
  2.35     30.32     1.55 200000000     0.00     0.00  sk_OSSL_DECODER_value
  2.31     31.86     1.53 208500000     0.00     0.00  OSSL_DECODER_get0_provider
  2.07     33.23     1.38 204000000     0.00     0.00  OSSL_PROVIDER_get0_provider_ctx
  1.89     34.48     1.25 162000000     0.00     0.00  ossl_decoder_get_number
  1.81     35.68     1.20   300008     0.00     0.00  sa_doall
  1.78     36.86     1.18 374800000     0.00     0.00  ossl_assert_int
  1.63     37.94     1.08 48100000     0.00     0.00  resolve_name
  1.52     38.95     1.01 204000000     0.00     0.00  ossl_provider_prov_ctx
  1.51     39.95     1.00 50000058     0.00     0.00  ossl_namemap_name2num
  1.45     40.91     0.96 58601805     0.00     0.00  OPENSSL_LH_retrieve
  1.42     41.85     0.94 50400116     0.00     0.00  ossl_namemap_stored
  1.42     42.79     0.94  8300040     0.00     0.00  CRYPTO_UP_REF
  1.24     43.62     0.82 50400116     0.00     0.00  ossl_namemap_empty
  1.17     44.39     0.78  4000000     0.00     0.00  collect_decoder
  1.09     45.11     0.72  8601072     0.00     0.00  OPENSSL_LH_strhash
  1.09     45.83     0.72   100000     0.00     0.00  EVP_DecodeUpdate
  1.03     46.52     0.69 66832523     0.00     0.00  ossl_tolower
  1.00     47.17     0.66 50000861     0.00     0.00  namemap_name2num
  0.97     47.81     0.64 66800684     0.00     0.00  ossl_lib_ctx_get_data
  0.92     48.42     0.61 68402450     0.00     0.00  CRYPTO_THREAD_unlock
  0.83     48.98     0.55 50001129     0.00     0.00  namenum_hash
  0.79     49.50     0.53 50000861     0.00     0.00  lh_NAMENUM_ENTRY_retrieve
  0.75     50.00     0.50                             pthread_rwlock_rdlock
  0.72     50.48     0.48  7901304     0.00     0.00  OPENSSL_strcasecmp
  0.72     50.96     0.48   800000     0.00     0.00  bin2bn
  0.70     51.42     0.47 68100866     0.00     0.00  CRYPTO_THREAD_read_lock
  0.70     51.89     0.47   100000     0.00     0.00  sk_OSSL_DECODER_new_null
  0.66     52.33     0.44                             OSSL_DECODER_get0_properties
  0.63     52.74     0.41                             OSSL_PROVIDER_get0_dispatch
  0.59     53.13     0.39 86900000     0.00     0.00  conv_ascii2bin
  0.59     53.52     0.39   700000     0.00     0.00  evp_decodeblock_int
  0.59     53.91     0.39                             pthread_rwlock_unlock
  0.56     54.28     0.37                             OSSL_DECODER_get0_name
  0.50     54.62     0.33   100070     0.00     0.00  doall_util_fn
  0.49     54.94     0.33 45000000     0.00     0.00  collect_decoder_keymgmt
  0.48     55.26     0.32  8600000     0.00     0.00  ossl_bsearch
  0.47     55.57     0.31 45100000     0.00     0.00  sk_EVP_KEYMGMT_value
  0.45     55.87     0.30  8600434     0.00     0.00  ossl_property_string
  0.44     56.16     0.29                             llseek
  0.37     56.41     0.24 68200843     0.00     0.00  ossl_lib_ctx_get_concrete
  0.37     56.65     0.24  4300000     0.00     0.00  ossl_decoder_instance_new
  0.34     56.88     0.23 26800000     0.00     0.00  do_name
  0.30     57.08     0.20  8600000     0.00     0.00  ossl_property_find_property
  0.29     57.27     0.19  6300000     0.00     0.00  ossl_property_str
  0.27     57.45     0.18 54400098     0.00     0.00  ossl_provider_libctx
  0.27     57.62     0.18   300112     0.00     0.00  OPENSSL_sk_pop_free
  0.27     57.80     0.18                             _int_malloc
  0.26     57.98     0.17   300000     0.00     0.00  ossl_lib_ctx_is_default
  0.25     58.15     0.17 13700889     0.00     0.00  CRYPTO_zalloc
  0.25     58.31     0.17   301584     0.00     0.00  CRYPTO_THREAD_write_lock
  0.24     58.47     0.16  1000000     0.00     0.00  asn1_item_embed_d2i
  0.24     58.63     0.16                             __memset_avx2_unaligned_erms
  0.23     58.78     0.15  8300080     0.00     0.00  OSSL_DECODER_free
  0.23     58.94     0.15                             ossl_toupper
  0.23     59.09     0.15  4200000     0.00     0.00  alg_do_each
  0.23     59.24     0.15   300000     0.00     0.00  decoder_process
  0.20     59.38     0.14      268     0.00     0.00  lh_NAMENUM_ENTRY_error
  0.20     59.51     0.13  1800368     0.00     0.00  BIO_gets
  0.20     59.63     0.13   100028     0.00     0.00  OPENSSL_LH_num_items
  0.20     59.77     0.13                             ossl_provider_teardown
  0.17     59.88     0.12  9800282     0.00     0.00  OPENSSL_sk_num
  0.17     59.99     0.11 15200000     0.00     0.00  property_idx_cmp
  0.17     60.10     0.11 12501184     0.00     0.00  CRYPTO_THREAD_run_once
  0.17     60.21     0.11  8800405     0.00     0.00  OPENSSL_sk_insert
  0.17     60.32     0.11  8800405     0.00     0.00  OPENSSL_sk_push
  0.17     60.43     0.11       58     0.00     0.00  ossl_namemap_name2num_n

t-j-h commented 1 year ago

And these probably help seeing the context.

-----------------------------------------------
                2.60   46.86  100000/100000      OSSL_DECODER_CTX_new_for_pkey [8]
[9]     74.6    2.60   46.86  100000         OSSL_DECODER_CTX_add_extra [9]
                6.12   32.58 200000000/200000000     collect_extra_decoder [10]
                1.55    3.07 200000000/200000000     sk_OSSL_DECODER_value [32]
                0.01    2.83  100000/200000      OSSL_DECODER_do_all_provided [30]
                0.47    0.00  100000/100000      sk_OSSL_DECODER_new_null [103]
                0.05    0.08 5000000/266900000     sk_OSSL_DECODER_INSTANCE_value [25]
                0.01    0.07  100000/100000      sk_OSSL_DECODER_pop_free [201]
                0.01    0.00 5000000/9400000     OSSL_DECODER_INSTANCE_get_input_type [342]
                0.00    0.00  100000/400000      sk_OSSL_DECODER_INSTANCE_num [396]
                0.00    0.00  100000/100000      sk_OSSL_DECODER_num [463]
                0.00    0.00  100000/5000000     ossl_assert_int [284]
-----------------------------------------------
                6.12   32.58 200000000/200000000     OSSL_DECODER_CTX_add_extra [9]
[10]    58.4    6.12   32.58 200000000         collect_extra_decoder [10]
                3.61   15.65 202100000/204300000     ossl_decoder_fast_is_a [11]
                2.56    3.98 259400000/266900000     sk_OSSL_DECODER_INSTANCE_value [25]
                1.35    0.99 200000000/204000000     OSSL_PROVIDER_get0_provider_ctx [42]
                0.12    2.02 2100000/4300000     ossl_decoder_instance_new [33]
                1.47    0.63 200000000/208500000     OSSL_DECODER_get0_provider [45]
                0.04    0.09 1800000/4300000     ossl_decoder_instance_free [120]
                0.01    0.03 1900000/1900000     pem2der_newctx [230]
                0.01    0.02  300000/2500000     ossl_decoder_ctx_add_decoder_inst [133]
                0.00    0.00 2100000/9400000     OSSL_DECODER_INSTANCE_get_input_type [342]
                0.00    0.00  100000/100000      epki2pki_newctx [460]
                0.00    0.00  100000/100000      spki2typespki_newctx [461]
-----------------------------------------------
                0.04    0.17 2200000/204300000     decoder_process <cycle 5> [36]
                3.61   15.65 202100000/204300000     collect_extra_decoder [10]
[11]    29.4    3.65   15.82 204300000         ossl_decoder_fast_is_a [11]
                1.08   12.98 48100000/48100000     resolve_name [12]
                1.25    0.51 162000000/162000000     ossl_decoder_get_number [50]
-----------------------------------------------
                1.08   12.98 48100000/48100000     ossl_decoder_fast_is_a [11]
[12]    21.2    1.08   12.98 48100000         resolve_name [12]
                0.96    9.41 48100000/50000058     ossl_namemap_name2num [13]
                0.90    1.55 48100000/50400116     ossl_namemap_stored [41]
                0.16    0.00 48100000/54400098     ossl_provider_libctx [136]
-----------------------------------------------
                0.00    0.00      58/50000058     ossl_namemap_name2num_n [177]
                0.00    0.02  100000/50000058     evp_is_a [262]
                0.04    0.35 1800000/50000058     OSSL_DECODER_is_a [98]
                0.96    9.41 48100000/50000058     resolve_name [12]
[13]    16.3    1.00    9.78 50000058         ossl_namemap_name2num [13]
                0.66    8.33 50000058/50000861     namemap_name2num [15]
                0.45    0.00 50000058/68402450     CRYPTO_THREAD_unlock [89]
                0.34    0.00 50000058/68100866     CRYPTO_THREAD_read_lock [106]
-----------------------------------------------

beldmit commented 1 year ago

I think @mattcaswell has optimized the decoder speed - or was it a different code path?

t-j-h commented 1 year ago

This profile is from current master

beldmit commented 1 year ago

I remember that Matt has significantly improved the performance in decoding and it should be present in both 3.1 and master. I can't find the exact issue, unfortunately.

And I don't understand why we have a slowdown in master comparing to 3.1

beldmit commented 1 year ago

callgrind shows me 30% time spent in ossl_x25519_public_from_private. It's completely weird if we are speaking about RSA key.

beldmit commented 1 year ago

Nevermind, have old files in my folder, sorry for the noise

kroeckx commented 1 year ago

This reminds me that we had/have code that tries all possible formats and key types, even when we already know which format and key type it is. There are just too many calls to those functions.

mattcaswell commented 1 year ago

I think @mattcaswell has optimized the decoder speed - or was it a different code path?

I didn't optimize the decoder code itself. There were some specific APIs which were making non-optimal use of the decoders which I fixed. But if you don't happen to call those very specific APIs then you won't notice a difference.

beldmit commented 1 year ago

Does this function use optimal calls?

mattcaswell commented 1 year ago

The reproducer code that @hannob has supplied is just calling PEM_read_PrivateKey - so that should be fine. I've not investigated this reproducer further than that.

beldmit commented 1 year ago

So can we deal with a slowdown here?

mattcaswell commented 1 year ago

Nothing obvious is springing to mind. But I'll take a look and see if I can come up with anything.

mattcaswell commented 1 year ago

For master compared to 3.1 I'm seeing approximately 5% slow down.

mattcaswell commented 1 year ago

The picture is complicated but a significant part of the slow down between master and 3.1 seems to be due to #18819, which adds a new decoder. Possibly due to a multiplying effect of the number of decoders being considered.

beldmit commented 1 year ago

Maybe it's worth prioritize RSA/EC/Edwards curves somehow in decoder iterations?

davidben commented 1 year ago

There may be some structural issues to address before doing that. This collect_extra_decoder function seems to called 2000 times in a single private key parsing operation.

Poking around the code in the profile, it seems OpenSSL is quadratic in the number of decoders you have, and there are quite a lot of them. I don't fully follow what OSSL_DECODER_CTX_add_extra is doing, but it seems OpenSSL now models generic "container" formats like PEM, PrivateKeyInfo, etc., as pluggable conversion functions with input/output formats? And then, before it does anything, it looks like OSSL_DECODER_CTX_add_extra tries to build all possible paths through these conversion functions, matching inputs to outputs?

That first layer of this path-building starts with a set of leaf "decoder instances" and then, for each one (twice, actually), iterates over every "decoder" to find the ones that can chain backwards to that decoder instances input type. That means decoders are tapped on the head 2 decoder_instances decoders times. (And then the next layer adds another O(decoders) iterations. I suspect this would also go exponential if the wrong series of container formats were added to the system.)

I couldn't find any documentation on why this path-builder design was chosen. It is a little odd to me because one shouldn't need to explore the whole tree of paths to parse one input. I think this comes out of OpenSSL working backwards from all possible outputs, rather than working forwards from the one input you have. The way these container formats are designed (and generally when parsing things), the forwards is the natural direction:

Based on PEM_read_PrivateKey being called, we know the input is PEM. PEM is a type string (BEGIN WHATEVER) followed by base64-encoded data.
Run the PEM decoder to get out the type string and base64-decode the data. This decoder can be key-independent. It's just fancy base64.
Based on the type string, call the appropriate decoder on the data. This could be made pluggable by allowing providers to register decoders for PEM types
- In @hannob's sample, the type is RSA PRIVATE KEY, so you find the provider(s) that support RSA PRIVATE KEY and ask them to parse the data. They should parse an RSAPrivateKey ASN.1 structure and that's the end of it.
- In the original report, the type is PRIVATE KEY. That means the type is PKCS#8 PrivateKeyInfo. That too is a generic, provider-independent format that, more-or-less, expands to another (type, data) pair. This time the type is an AlgorithmIdentifier.
- Call the PKCS#8 PrivateKeyInfo decoder to extract the type and data. This decoder can also be key-independent.
- Now, based on the type OID in the PrivateKeyInfo, call the appropriate decoder. This likewise can be made pluggable by allowing providers to register decoders for PKCS#8 types.
- If the type is rsaEncryption, find the provider(s) that support that and ask them to parse the data. They should again parse an RSAPrivateKey ASN.1 structure and that's the end of it.

At no point do you need to explore every possible chain of decoders. The only operation you need is "I have an input of format X. Call the function that parses X". Where there need only be a single implementation, that can be a direct function call. Where you wish it to be pluggable, that only needs a simple lookup. A container format decoder is just something that needs to do that dispatch a second time.

mattcaswell commented 1 year ago

Possible patch in #21426

mattcaswell commented 1 year ago

Regardless of my patch I think @davidben makes good points and we do need to look at the design of the decoder subsystem.

beldmit commented 1 year ago

@mattcaswell I like your improvement but it adds complexity of the system - and @davidben describes smth that looks to me like a simplification...

mattcaswell commented 1 year ago

@mattcaswell I like your improvement but it adds complexity of the system - and @davidben describes smth that looks to me like a simplification...

Yes, I agree but it's probably more of a 3.3 thing whereas I think we can get my patch into 3.2. It's not completely clear to me if we can do what @davidben suggests without breaking changes/new API. It's at the least a significant refactor. I do think we should do it though.

t8m commented 1 year ago

Yeah, the proposition by @davidben is definitely not something we could get into 3.2. Now the question is whether https://github.com/openssl/openssl/pull/21426 is OK for 3.2 or not, as it is not particularly trivial either.

davidben commented 11 months ago

Yes, I agree but it's probably more of a 3.3 thing whereas I think we can get my patch into 3.2.

Now that 3.2 is released and you all are working on 3.3, I may suggest taking some time to think about the structural issues in the encoder/decoder system. Working backwards and guessing formats is not a performant, secure, or well-defined way to decode an input. Start from the input type that the caller asked you to parse and work forwards.

mattcaswell commented 11 months ago

This is already on our radar of things we might get into 3.3:

https://github.com/openssl/project/issues/100

t8m commented 11 months ago

Start from the input type that the caller asked you to parse and work forwards.

Unfortunately we will also have to support the wildcard decoding as regressing this is not an option.

davidben commented 11 months ago

Unfortunately we will also have to support the wildcard decoding as regressing this is not an option.

By wildcard decoding, I assume you mean where the caller forgets to tell you the format, and OpenSSL has to try between several overlapping (i.e. ambiguous) formats? Ambiguous parsers are a thoroughly well-studied and well-understood source of security vulnerabilities, not to mention unpredictable behavior (i.e backwards compatibility risks). Hopefully you all carefully analyzed all pairs of supported formats before exposing them to be part of a wildcard decode. E.g. all 32-byte strings are valid raw X25519 keys, but a 32-byte string may also be something else entirely, so raw X25519 can never participate in such a scheme.

Design problems aside, I see no reason why it requires this backwards parse. Even if the caller forgets to tell you the format, that problem only applies at the top level. Once you've decided to try, say, DER PKCS#8 PrivateKeyInfo, or PEM[*], all parsing from then on is unambiguous. You only need to try ambiguous cases and backtrack at the top-level. And, of course, even if you did need to backtrack on the inner layers (again, you do not), that's still no reason to build the graph ahead of time.

You really don't need to pre-explore all possible call graphs like this. It's just overhead, not flexibility.

[*] PEM and DER are not analogous, so a good parser API design would take this into account. PEM contains a type header, so the caller can actually say "I have a PEM private key" and let the library dispatch between "BEGIN RSA PRIVATE KEY" and "BEGIN PRIVATE KEY". Whereas a DER encoding is not necessarily sufficient to identify the structure, so the caller must say "I have a DER RSAPrivateKey" or "I have a DER PrivateKeyInfo".

t8m commented 11 months ago

Design problems aside, I see no reason why it requires this backwards parse.

Yeah, sure. I did not say that supporting this wildcard decode isn't possible with forward parsing. I am just saying we will have to be able to backtrack even to the topmost level.

krk commented 6 months ago

Now OpenSSL 3.4.0-dev seems to be <2 times slower than 1.1.1w, with the OP code (on a noisy intel x86_64 machine):

pem

hyperfine -r 100 -w 10 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_pem_rsa-ossl3-dev' 'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_pem_rsa-ossl1-dev'
Benchmark 1: LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_pem_rsa-ossl3-dev
  Time (mean ± σ):     100.2 ms ±  15.8 ms    [User: 98.8 ms, System: 1.5 ms]
  Range (min … max):    89.6 ms … 147.7 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_pem_rsa-ossl1-dev
  Time (mean ± σ):      51.3 ms ±  13.6 ms    [User: 49.7 ms, System: 1.8 ms]
  Range (min … max):    40.6 ms …  87.0 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_pem_rsa-ossl1-dev' ran
    1.95 ± 0.60 times faster than 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_pem_rsa-ossl3-dev'

der

hyperfine -r 100 -w 10 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_der-ossl3-dev' 'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_der-ossl1-dev'
Benchmark 1: LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_der-ossl3-dev
  Time (mean ± σ):      37.8 ms ±  10.5 ms    [User: 36.9 ms, System: 1.4 ms]
  Range (min … max):    29.7 ms …  84.5 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_der-ossl1-dev
  Time (mean ± σ):      31.4 ms ±   8.6 ms    [User: 29.9 ms, System: 1.8 ms]
  Range (min … max):    13.5 ms …  46.2 ms    100 runs

Summary
  'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_der-ossl1-dev' ran
    1.20 ± 0.47 times faster than 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_der-ossl3-dev'

t8m commented 6 months ago

Now OpenSSL 3.4.0-dev seems to be <2 times slower than 1.1.1w, with the OP code (on a noisy intel x86_64 machine):

I am afraid this is as fast as it can be with the current design.

mattcaswell commented 6 months ago

Interesting that the der decoding was quite a bit faster than pem. I wonder why that might be?

krk commented 6 months ago

I observed that base64 decoding (EVP_DecodeUpdate) was the slowest for the pem case.

mattcaswell commented 6 months ago

I observed that base64 decoding (EVP_DecodeUpdate) was the slowest for the pem case.

You mean that base64 decoding was significantly slower in 3.4 than 1.1.1?

krk commented 6 months ago

I haven't checked that, what I saw was, in 3.4, the pem decoding case base64 decoding operation was the slowest. Can share profiler output if necessary (don't have access to that right now).

t8m commented 6 months ago

Interesting that the der decoding was quite a bit faster than pem. I wonder why that might be?

My suspicion is the chaining - the der decoding does not chain any decoders, it directly does der2key decoding. Where in the pem decoding case there are pem2der and der2key decoders chained.

beldmit commented 6 months ago

It can be an issue. And also we can have a bottleneck on base64 itself as I don't think we have ever optimized it

beldmit commented 6 months ago

E.g. https://github.com/lemire/fastbase64

davidben commented 6 months ago

I am afraid this is as fast as it can be with the current design.

Keep in mind the notes here. The current design is parsing things in the wrong direction and is doing piles and piles of unnecessary work. https://github.com/openssl/openssl/issues/15199#issuecomment-1631038583

Setting aside whether all the design goals were wise, OpenSSL could still have met its design goals with a better implementation. These are, at the end of the day, not very complex formats.

Regarding base64, keep in mind that PEM is often used to carry private keys. We actually switched our base64 decoder to a constant-time one.

openssl / openssl

PEM/DER decoding of PKCS8 RSA private keys are 80 times slower in 3.0 #15199

pem

der