Open DDvO opened 3 years ago
Yes :(
BTW, also the rather blind trail&error approach in OSSL_STORE when parsing credential files appears pretty inefficient.
Though make test TESTS="test_encoder_decoder"
requires "only" some 210,000 to 220,000 mem alloc calls per run.
Yes, it's very inefficient. How else should we parse files whose formats we don't necessarily know? The parsing is from a provider and there will be (are?) providers written by other people.
I don't have much insight in the provider topic, but as far as known credential types and formats (in particular, PEM and DER/ASN.1) are involved, theses formats include some header information that should help at least excluding most of the multitude of cases. The remaining actually possible case(s) should be tried first (in the order of how promising/likely they are, if more than one) before trying externally provided parsers, about which maybe no assumptions can be made.
PEM has a type header, but DER does not. Two ASN.1 structures whose top-level tag is SEQUENCE will have the same set of initial headers. You have to continue parsing to distinguish them.
But I think the header business is a red herring. I think the bigger issue is why the provider APIs require OpenSSL to guess at what format it is parsing. That sounds like a design flaw. Having to guess formats results in fragile systems and security issues. (Suppose the two syntaxes overlap and attackers can force an ambiguous inputs in.) For examples, see the many many problems that have come from content-sniffing in the web.
Instead, APIs to decode or encode structures should require the caller pass in enough information to determine the format.
Instead, APIs to decode or encode structures should require the caller pass in enough information to determine the format.
No, that is impossible - it would mean you always know what you're reading and decoding. But you simply do not have this information in many cases. But yes, in cases the caller knows what it is decoding, it should supply the information and the decoding process should limit the decoders to only the relevant ones. IMO that is already the case in many cases but maybe there are callers which do not fully supply the information.
Can you give an example of when the caller would not know?
Reading a private key to perform a signature where I do not know whether the key is in PEM or DER format and what algorithm the key uses? Something quite common and possible with the current interfaces that give you the crypto agility.
It is actually much less common that the caller knows exactly what the user is trying to load.
And yes, we need to support both cases
And IMO, apart from possible bugs, the current code already handles both cases.
DER is not a single format. There are DER SubjectPublicKeyInfos, DER RSAPublicKeys, etc., each of which need separate parsing entrypoints because DER structures are not, a priori, distinguishable. Treating PEM as a single entrypoint is a bit more plausible because there is a type in it.
Not knowing what algorithm the key uses is fine. (Well, it's not pretty dangerous but we'll set that aside.) There are plenty of algorithm-agnostic formats like PEM, SubjectPublicKeyInfo, PrivateKeyInfo, etc. But if you're instead trying to parse an algorithm-specific format like RSAPublicKey, X9.62 EC point serialization, raw Ed25519 keys, etc., you really need to know what format you've got.
Yes, it's true that many pairs of formats coincidentally don't overlap. RSAPrivateKey and ECPrivateKey are both SEQUENCEs that begin with an INTEGER, but the fields afterwards don't align. But there's no guarantee of this, especially for a generic pluggable provider framework. Any 32-byte string can be parsed as either a raw Ed25519 private key or a raw X25519 private key (not the same). Maybe some other provider also sometimes serializes as 32 bytes. Even PEM and a DER structure, while non-overlapping on the surface, overlap. PEM usually ignores non-PEM lines and DER encodes bytes as-is. So an X.509 certificate or PKCS#8 PrivateKeyInfo that happened to include a PEM string in an attribute would parse as either.
This overlap means sniffing-based parsers can be tricked to misinterpreting an input in the wrong format. This can result in security bugs, such as my favorite paper title ever. :-)
The example isn't very concrete, but I'm guessing you mean some command-line utility that tries to be friendly and guess the format of some random file passed on the command-line. This is very risky, but sure, there are applications like that. Those can still be supported without blessing the dangerous patterns in OpenSSL's API. Instead, the applications can calling the separate parsers explicitly. It means they must enumerate the formats, but that's desirable. It limits the confusable formats to a closed set and they can be analyzed for overlap. It also avoids encouraging this behavior in other applications, where this implicitness is a security bug. Those applications can still support providers by including an algorithm-agnostic format.
As an analogy, suppose an application is stuck needing to guess base64 vs. hex. You don't see standard libraries provide a decodeBase64OrHex
function. The functions are simply decodeBase64
and decodeHex
. If the application needs to guess, the application can try one followed by the other. That's good because base64 and hex encodings have overlaps and this kind of guessing is usually dangerous.
And yes, we need to support both cases
* the caller knows exactly what he is loading - in that case it should not try decoders that are irrelevant and it should properly fail if the supplied data is something else than what was expected * the caller does not know anything or has only some partial knowledge - in that case the decoding code should be flexible enough to load the data by trying the available decoders
And IMO, apart from possible bugs, the current code already handles both cases.
Unfortunately, at least for file_load()
in file_store.c
as typically called by OSSL_STORE_load()
, as called by load_key_certs_crls()
in apps.c
there is a related major inefficiency:
decoder_process()
in decoder_lib.c
runs through currently 160 decoder instances regardless whether OSSL_STORE_expect(ctx, expect))
has been called, e.g, with expect
being OSSL_STORE_INFO_CERT
. The latter is just used to filter the results after having been parsed. This may be a partly inherent problems for, e.g., PEM and PKCS#12 files which form a "mixed bag", potentially containing not only certs but also keys etc.
Even worse, all iterations are done regardless whether already one of them was successful parsing, e.g., a cert - @levitte, why?
Moreover, 160 iterations is done not only for each credential found in the file but also at EOF.
So in case a file holds one cert, 2*160 decoders are called :frowning_face:
Even worse, all iterations are done regardless whether already one of them was successful parsing, e.g., a cert - @levitte, why?
This might be better in current master. Did you try it?
However yes, the file store behavior is really bad and we will probably need to do something with it. I'd like to at least give hints in forms of decoder parameters to the store as proposed in #14569
Has anyone measured the time it takes for 1.1 "load from PEM" API's compared to the current master "load from PEM" API's?
This might be better in current master. Did you try it?
It's still this way: 2*160 decoder loop iterations for loading 1 credential.
@davidben, I agree with many of your points, but one big downside of the enumeration of formats of interest approach is that it requires knowing all the possible formats beforehand.
For example if a provider plugged in support for the PQFOOBAR algorithm previously unknown to OpenSSL, applications that list the possible formats they are interested in, would be cut out from using PQFOOBAR keys without code changes.
In the end it would basically mean that you would need to fork openssl to add full support for new algorithms, defying several of the goals of the provider design.
Has anyone measured the time it takes for 1.1 "load from PEM" API's compared to the current master "load from PEM" API's?
I've just done some coarse measurements at app level loading 3 certs from PEM files, with malloc stats:
time apps/openssl verify -trusted test/certs/root-cert.pem -untrusted test/certs/ca-cert.pem test/certs/ee-cert.pem
For the current master, this yields
malloc count: 18790 realloc count: 960 malloc total size: 1508819
real 0m0.016s
user 0m0.015s
sys 0m0.001s
whereas for 1.1.1 the figures are
malloc count: 6552 realloc count: 682 malloc total size: 194318
real 0m0.005s
user 0m0.004s
sys 0m0.001s
using this patch:
diff --git a/apps/verify.c b/apps/verify.c
index 718174a83d..fad17a3d92 100644
--- a/apps/verify.c
+++ b/apps/verify.c
@@ -239,6 +239,7 @@ int verify_main(int argc, char **argv)
sk_X509_CRL_pop_free(crls, X509_CRL_free);
sk_OPENSSL_STRING_free(vfyopts);
release_engine(e);
+ OPENSSL_malloc(0); /* print stats */
return (ret < 0 ? 2 : ret);
}
diff --git a/crypto/mem.c b/crypto/mem.c
index d682a3686f..49239047ff 100644
--- a/crypto/mem.c
+++ b/crypto/mem.c
@@ -162,8 +162,16 @@ void ossl_malloc_setup_failures(void)
}
#endif
+static size_t realloc_num = 0;
void *CRYPTO_malloc(size_t num, const char *file, int line)
{
+ static size_t malloc_num = 0;
+ static size_t malloc_total = 0;
+ malloc_num++;
+ malloc_total += num;
+ if (num == 0)
+ printf("malloc count: %zu realloc count: %zu malloc total size: %zu\n",
+ malloc_num, realloc_num, malloc_total);
INCREMENT(malloc_count);
if (malloc_impl != CRYPTO_malloc)
return malloc_impl(num, file, line);
@@ -198,6 +206,7 @@ void *CRYPTO_zalloc(size_t num, const char *file, int line)
void *CRYPTO_realloc(void *str, size_t num, const char *file, int line)
{
+ realloc_num++;
INCREMENT(realloc_count);
if (realloc_impl != CRYPTO_realloc)
return realloc_impl(str, num, file, line);
Here's a spot where @DDvO makes a good point... the diverse decoder implementations are added more than once to the chain, precisely once for each possible name of the target type (keymgmt in the EVP_PKEY case) implementation. So for example, with RSA having all these names, that's how many times the DER to RSA decoder is added:
This should be possible to optimize so only one decoder per target type implementation rather than per target type name is added to the chain. I view that as a bug, so something that doesn't have to be fixed before the beta1 release... suffice to say that it's on my mind, but isn't top priority for the moment. Of course, if someone else wants to take a stab at it and gets us a decent fix, that would be a good thing for all.
Another factor is the possible input types, at least "pkcs8" and "type-specific". That one is harder to get away from.
@davidben, I agree with many of your points, but one big downside of the enumeration of formats of interest approach is that it requires knowing all the possible formats beforehand.
For example if a provider plugged in support for the PQFOOBAR algorithm previously unknown to OpenSSL, applications that list the possible formats they are interested in, would be cut out from using PQFOOBAR keys without code changes.
In the end it would basically mean that you would need to fork openssl to add full support for new algorithms, defying several of the goals of the provider design.
I do not understand the discussion regarding key types. When a user or app wants to load a cert (from a file, stream, etc.), the only degree of freedom is whether it is in PEM or ASN.1 DER format - and this can be determined clearly by looking at the first byte of the contents. The same should hold when loading a private key, a public key, or any other type of credential. This should work regardless which type of key is in the cert to be loaded of which type of (private/public) key is to be loaded. If not, I consider this a design flaw of the PEM/DER encoding.
I do not understand the discussion regarding key types. When a user or app wants to load a cert (from a file, stream, etc.), the only degree of freedom is whether it is in PEM or ASN.1 DER format - and this can be determined clearly by looking at the first byte of the contents.
(note: first byte is false, a PEM file can have arbitrary text around the PEM contents... as a matter of fact, I've seen more than one certificate PEM file that's the whole output of openssl x509 -text -in cert.der -inform der
... that's the whole human readable output followed by the PEM content. That's been legitimate since the beginning of time)
Decoding PEM into DER is simple. Decoding DER into an EVP_PKEY for key types that libcrypto doesn't have internal knowledge of is a completely different ballgame.
I do not understand the discussion regarding key types. When a user or app wants to load a cert (from a file, stream, etc.), the only degree of freedom is whether it is in PEM or ASN.1 DER format - and this can be determined clearly by looking at the first byte of the contents.
(note: first byte is false, a PEM file can have arbitrary text around the PEM contents... as a matter of fact, I've seen more than one certificate PEM file that's the whole output of
openssl x509 -text -in cert.der -inform der
... that's the whole human readable output followed by the PEM content. That's been legitimate since the beginning of time)
Right, but ASN.1 content does not allow such arbitrary prefix and must start with 0x30
(V_ASN1_SEQUENCE | V_ASN1_CONSTRUCTED
). So the only possible ambiguity is when the input starts with '0'
. In such a case the parser can take into account some more lookahead and determine if the file is binary or not. In the worst case it must scan as long as it finds printable chars (or CR or LF) until it can match a PEM header. So an automatic PEM vs. DER input distinction is feasible.
Decoding PEM into DER is simple.
Yes.
Decoding DER into an EVP_PKEY for key types that libcrypto doesn't have internal knowledge of is a completely different ballgame.
Maybe there are ambiguities within the DER encoding of EVP_PKEY (which I'd consider a design flaw). Yet for all other types of credentials there should not be such ambiguities (hopefully), and in particular when a cert is expected there should be exactly one DER parser that is applicable.
Decoding DER into an EVP_PKEY for key types that libcrypto doesn't have internal knowledge of is a completely different ballgame.
This should not be a requirement for 3.0 Especially not if it's three times slower than it was. That's not a trade-off that most OpenSSL users would like.
I know you implied you'll look at it after beta, I just want to say how important it is.
The OSSL_DECODER does not have so high overhead by itself.
It is more likely that we are taking some code paths that are bound to fail multiple times.
and in particular when a cert is expected there should be exactly one DER parser that is applicable.
Right... but that requires premature knowledge of what there is in there, and it very much depends on which route you take. OSSL_STORE is designed to be a "throw whatever at me and I'll figure it out for you" machine, and it relies quite heavily on OSSL_DECODER with minimal input to help figuring things out.
... now that I'm thinking of it, it's quite possible the pre-fetching all decoder implementations that the decoding process may need wasn't the best choice. Fetching them during the process instead, on demand, might have been the better choice.
For the practically important special cases of loading a cert or a list of certs (and likewise for CRLs) we might branch back to what we had immediately before switching to OSSL_STORE
(which would amount to reverting part of b3c5aadf4ce3dc2207cc605726bf370a55b531c9): using the specific PEM, DER, and PKCS#12 loaders.
Of course, only internally, such that the high-level calls with their convenience and flexibility would not change.
I don't think fixing the apps is that important; making the API's that downstream developers used to use hav3e no performance regression is more important.
I don't think fixing the apps is that important; making the API's that downstream developers used to use hav3e no performance regression is more important.
One fix for that is #15045
Of course "no performance regression" is unrealistic given the new features and the whole provider machinery and the need to support both providers and legacy. But the current overhead looks a little extreme. It would be interesting to check how do the memory allocations look like on some simple code using PEM_read_PrivateKey_bio() (with #15045) or d2i_PrivateKey().
@davidben, I agree with many of your points, but one big downside of the enumeration of formats of interest approach is that it requires knowing all the possible formats beforehand.
For example if a provider plugged in support for the PQFOOBAR algorithm previously unknown to OpenSSL, applications that list the possible formats they are interested in, would be cut out from using PQFOOBAR keys without code changes.
In the end it would basically mean that you would need to fork openssl to add full support for new algorithms, defying several of the goals of the provider design.
Not at all. It merely requires understanding the various serialization formats a little better, and handling the extensibility in a more sound manner. Suppose you wish to plug in PQFOOBAR into OpenSSL, such that existing applications automatically use it. Consider what those applications are doing:
If the application is defined to store DER RSAPrivateKey structures, well, you're out of luck. That application uses an RSA-specific serialization and it is specifically trying to just do RSA. That's not inherently wrong, it's just the application wasn't trying to do anything algorithm-agnostic.
If the application is defined to store DER SubjectPublicKeyInfo structures, DER PrivateKeyInfo, or PEM structures, that's fine. Those formats are already algorithm-agnostic. If the application uses those formats, it can potentially support new algorithms and there is no guessing necessary.
And so I would encourage you all to think of the APIs from that angle. Treat ParseSPKI
, ParsePKCS8
, ParseRSAPrivateKey
, and ParsePQFooBarSpecificFormat
as separate operations that the caller should explicitly pick from. (Whether, mechanically, these are separate functions or one big function with a stringly-typed "format" parameter is orthogonal. I prefer separate functions, but I gather stringly-typed is now the prevailing OpenSSL style. So it goes. :-/ )
Some of these operations (the first two) are implementable by many key types. Others (the last two) are implementable only by specific key types. That's no different from how PSS salt length is an RSA-PSS-only notion, streaming signing doesn't work for Ed25519, but single-shot signing works for all signature schemes.
If you do this, there shouldn't any of this guessing stuff, which will both perform better and avoid security bugs from format ambiguities.
I don't think fixing the apps is that important; making the API's that downstream developers used to use have no performance regression is more important.
Removing the inefficiencies at library level is certainly more important.
Yet even for the apparently relatively simple case of cert or CRL loading I wonder how this can be done.
The current OSSL_STORE
API offers via OSSL_STORE_expect()
a way to filter results after openly trying to parse everything in every imaginable way,
but apparently there is no way to restrict the wanted type(s) of credentials (and thus cut down unnecessary flexibility) beforehand such that parsing other types of credentials is not attempted.
Maybe the semantics of OSSL_STORE_expect()
can be changed to do the latter?
Setting milestone so this gets discussed within OTC.
OSSL_STORE_expect()
passes on the expect number to the OSSL_STORE loader implementation, and it's perfectly feasible for a loader implementation to use it for optimization. That has always been the case. Our current implementation may be a bit sloppy around that, so it's should be possible to work that out.
I just had a look, and it's quite possible that I have a very easy optimization for the expect thing
Have a look at #15066
Have a look at #15066
Cool - this already helped enormously for loading certs (and likely also CRLs)! For my simple 'benchmark' invocation:
time apps/openssl verify -trusted test/certs/root-cert.pem -untrusted test/certs/ca-cert.pem test/certs/ee-cert.pem
rather than 160 iterations only 1 or 2 decoder process iterations per cert remain :smiley: and the output becomes
DEBUG[file_setup_decoders]: expect_evp_pkey = 0
DEBUG[file_setup_decoders]: expect_evp_pkey = 0
DEBUG[file_setup_decoders]: expect_evp_pkey = 0
test/certs/ee-cert.pem: OK
malloc count: 8888 realloc count: 725 malloc total size: 1160114
real 0m0.008s
user 0m0.007s
sys 0m0.001s
So most of the inefficiency of the current master regarding cert loading:
malloc count: 18790 realloc count: 960 malloc total size: 1508819
real 0m0.016s
user 0m0.015s
sys 0m0.001s
is gone and the figures are now not that much worse than for 1.1.1:
malloc count: 6552 realloc count: 682 malloc total size: 194318
real 0m0.005s
user 0m0.004s
sys 0m0.001s
The other optimization will be to get only one instance of each decoder implementation in the chain, for EVP_PKEY. I have some ideas, but won't get to that this week.
Interestingly, while the number of malloc calls with #15066 compared to 1.1.1 "just" increases from 6552 to 8888, the total number of bytes allocated (not taking into account deallocation) still is nearly 6 times higher, increasing from 194318 to 1160114.
So there must be further reasons why 3.0-alpha is much more memory hungry.
And for make test TESTS="test_encoder_decoder"
I still get
malloc count: 9854794 realloc count: 24307 malloc total size: 1027522889
Of course "no performance regression" is unrealistic
Well, "obviously" I meant in the steady-state, after initialization, etc. Please see draft PR #15088 for a timing program.
This will be fixed when fixing #15538
And for
make test TESTS="test_encoder_decoder" V=1
I still getmalloc count: 9854794 realloc count: 24307 malloc total size: 1027522889
This has improved a lot:
malloc count: 1986083 realloc count: 10357 malloc total size: 150450151
Hmm, for
time apps/openssl verify -trusted test/certs/root-cert.pem -untrusted test/certs/ca-cert.pem test/certs/ee-cert.pem
my last measurement given above was:
malloc count: 6552 realloc count: 682 malloc total size: 194318 real 0m0.005s user 0m0.004s sys 0m0.001s
yet meanwhile the figures have worsened:
malloc count: 11264 realloc count: 757 malloc total size: 629597
real 0m0.008s
user 0m0.008s
sys 0m0.000s
yet meanwhile the figures have worsened
The obvious reason being the change from legacy to provider based parsing of the public keys in certificates.
This got closed for a reason that appears not to the point: commit f0191d0b1373bb7b0c50a0103d63791f51ed3.
With the current master, the figures have improved since last year, but still are not good. For
test/endecode_test -rsa test/certs/ee-key.pem -pss test/certs/ca-pss-key.pem -context -config test/default.cnf -provider default
I just got
malloc count: 1051112 realloc count: 9370
For
apps/openssl verify -trusted test/certs/root-cert.pem -untrusted test/certs/ca-cert.pem test/certs/ee-cert.pem
I just got
malloc count: 7928 realloc count: 768
Reading this, I dont immediately see an actionable task here. @DDvO could you (re)state what your goal is. It seem evident you would like to see fewer calls to *alloc (the goal being improved performance), but I'm missing a level of profiling that indicates (a) what the impact of the larger number of calls is in terms of performance, and (b) what a reasonable goal to achieve here is. With that information we may be able to make additional forward progress here.
The performance impact of memory allocation cannot be reasonably measured in isolation and assigned a goal figure as it is not the root cause of the issue.
The extreme number of *alloc calls is not only a matter of inefficiency in itself but more fundamentally an alarming symptom of overly complex design and implementation.
As is the excessive growth of execution time and memory consumption (compared to pre-3.0) of some aspects of OpenSSL, such as the DER handling mentioned above.
@DDvO Thats all very true, but it doesn't answer the fundamental question - Why is this issue here?
From my reading of the history of this issue:
you identified that we had a huge number of memory allocations
General discussion suggested decoder inefficiencies were at least partially the cause
After much additional discussion and analysis, Things have improved
You note there is still a significant increase in memory usage in the decoders
@levitte Offers a another patch to further improve the situation
You note that malloc counts are much better, but still not 'good'
At which point the issue receives no further comments for 2+ years.
that last part concerns me greatly, because it suggests that, despite several things getting fixed, there's still "something" to do, but no one has offered a suggestion as to what that something is. Without knowing that, this issue isn't going to change, and no progress will be made.
Note, I'm not saying there isn't a problem here, only that there is no actionable statement of that problem, nor is there a proposal as to what to do about it.
Your statements above, about this being alarming are all true, I can't deny that, but none of those statements point to a specific problem, or action to take on them, and thats what we need to drive issues to closure, which is what I'm hoping to discern here in my comment.
Are you advocating for a radical redesign of the decoder process with the goal of simplifying its mechanics, and returning to a performance profile to that of earlier openssl releases? Thats a huge undertaking, but in fact something that can be considered, and likely something we should have a separate issue (ne epic, with lots of issues) for.
Are you advocating for something more targeted? More targeted DER decoder handling? thats also great, but a statement to that effect would be appreciated, and something we can create a separate issue for.
Are you advocating for something else? Also fine, its just unclear to me what that something is, and I would greatly appreciate clarification on that.
In summary, This issue has identified, and fixed several problems (which is great, by the way), but without further concrete direction, I'm concerned this issue is going to continue to atrophy without resolution. So my ask to you is to identify what problem remains (and ideally make a suggestion as to what to do about it), so that we can plan what needs to be done.
I know this feels a bit like bean counting, and I apologize for that, but I've got this goal to get our issue backlog to something more manageable, and to do that, I need to get issues refined to the point where they are actionable.
I should note here as well, that this isn't the first issue of this nature. We have lots of issues that I would describe as 'general concerns' that result in variious fixes taking place, but no clear path to resolution. Perhaps it would be beneficial to create a label for something like 'investegatory' issues. Such issues could be opened, assigned to the creator, and researched at will, but themselves have no directly actionable tasks (instead driving the creation of other, truly actionable tasks). That might give you the freedom to reasearch and investigate farther reaching concerns on an independent time frame, while slicing off specific fixes/changes that could reasonably be planned for given releases. Thoughts?
Thank you for your detailed response.
Yes, very good that this issue already sparked some improvements.
You are right that (regular) issues should be actionable with some tangible results and this is nearly impossible here because the issue as given is too broad/superficial.
I also have full understanding that issues like this are being questioned in the strive of cutting down the vast number of open issues.
Unfortunately I do not know if and when I or others will have the time to investigate in more detail into the sources of the symptom mentioned here. Clearly those how are deeper into the specific implementations (such as DER stuff) would be better candidates to do so than me.
I very much like the idea of introducing a new category for general issues like this. Other projects have similar categories, for instance Mozilla using the tag 'meta'.
Agreed, the more I think about it, the more I like the idea of a category of "not directly actionable" issues, that allow for general research and discussion, that can be ignored for planning purposes. Let me think about how best to do that, and I'll get it implemented
In the context of #14833 I tentatively limited the number of successful {m,re}alloc calls and found that OpenSSL application runs can cause terribly many such calls!
The worst example I found in our test suite is
make test TESTS="test_encoder_decoder"
which takes between 5,400,000 and 5,500,000 such calls.