open-eid / digidoc4j

DigiDoc for Java. Javadoc:
http://open-eid.github.io/digidoc4j
GNU Lesser General Public License v2.1
72 stars 40 forks source link

CRITICAL: if first tsl.refresh() fails, then library gets into invalid state and not able to verify any signatures #109

Closed cbxp closed 1 year ago

cbxp commented 2 years ago

Like is recommended, we call tsl.refresh() manually on the app start.

However, if there is no cache yet and the first request fails, the library does not remember that it is left in the invalid state and doesn't try to reload TSL on demand like it would do if tsl.refresh() wouldn't be called manually.

If this happens, tsl.refresh() logs some errors from other threads, but completes without exceptions.

All subsequent signature verifications result in OCSP errors until tsl.refresh() is called manually again and doesn't fail internally.

We have experienced this problem in several different applications on production. This is a critical issue.

Log of TSL loading failure:

INFO RetryExec - I/O exception (java.net.SocketException) caught when processing request to {s}->https://ec.europa.eu:443: Network unreachable

Logs when signature verification fails (due to previous TSL loading failure):

INFO CommonCertificateVerifier - + New CommonCertificateVerifier created.
INFO Configuration - Source by country <EE> not found, using default TSP source
Content is not allowed in prolog.
INFO XAdESLevelBaselineT - ====> Extending: IN MEMORY DOCUMENT
INFO CommonCertificateVerifier - + New CommonCertificateVerifier created.
INFO XAdESCertificateSource - +XAdESCertificateSource
INFO OnlineTSPSource - TSP Status: Operation Okay
INFO OnlineTSPSource - TSP SID : SN 9341197341178141782522153967405961574, Issuer C=EE,O=AS Sertifitseerimiskeskus,CN=EE Certification Centre Root CA,E=pki@sk.ee
INFO XAdESCertificateSource - +XAdESCertificateSource
INFO AIACertificateSource - Retrieving C-27B3CDA7A21ED109CE56A0A0D99512CED563645BD8C0515171957B6C90B0C21A certificate's issuer using AIA.
WARN CertificateReorderer - Issuer not found for certificate C-187BDA4F75689E5708B0B2288D8EA6E0D6C07EB8E7BD2C92859378349384BDD9
WARN SignatureValidationContext - External revocation check is skipped for untrusted certificate : C-27B3CDA7A21ED109CE56A0A0D99512CED563645BD8C0515171957B6C90B0C21A
WARN SignatureValidationContext - No revocation found for the certificate C-27B3CDA7A21ED109CE56A0A0D99512CED563645BD8C0515171957B6C90B0C21A
INFO AIACertificateSource - Retrieving C-187BDA4F75689E5708B0B2288D8EA6E0D6C07EB8E7BD2C92859378349384BDD9 certificate's issuer using AIA.
WARN SignatureValidationContext - External revocation check is skipped for untrusted certificate : C-187BDA4F75689E5708B0B2288D8EA6E0D6C07EB8E7BD2C92859378349384BDD9
WARN SignatureValidationContext - No revocation found for the certificate C-187BDA4F75689E5708B0B2288D8EA6E0D6C07EB8E7BD2C92859378349384BDD9
WARN SignatureValidationContext - External revocation check is skipped for untrusted certificate : C-E73F1F19A4459A6067A45E84DB585D6C1DF8F12A739D733F5B28996546F1875A
WARN SignatureValidationContext - No revocation found for the certificate C-E73F1F19A4459A6067A45E84DB585D6C1DF8F12A739D733F5B28996546F1875A
WARN LogHandler - Revocation data is missing for one or more POE(s). [POE 'C-27B3CDA7A21ED109CE56A0A0D99512CED563645BD8C0515171957B6C90B0C21A' not covered by a valid revocation data (nextUpdate : null)]
WARN LogHandler - Fresh revocation data is missing for one or more certificate(s). [Revocation data is skipped for untrusted certificate chain for the token : 'C-27B3CDA7A21ED109CE56A0A0D99512CED563645BD8C0515171957B6C90B0C21A', Revocation data is skipped for untrusted certificate chain for the token : 'C-187BDA4F75689E5708B0B2288D8EA6E0D6C07EB8E7BD2C92859378349384BDD9']
INFO CommonCertificateVerifier - + New CommonCertificateVerifier created.
INFO CommonCertificateVerifier - + New CommonCertificateVerifier created.
INFO XAdESCertificateSource - +XAdESCertificateSource
ERROR AsicSignatureFinalizer - Signature does not contain OCSP response
ERROR ErrorHandler - Unhandled exception
Apr 26 09:32:07 (Signature ID: id-c30ed397d743c511721777383407ac0d) - OCSP request failed. Please check GitHub Wiki for more information: https://github.com/open-eid/digidoc4j/wiki/Questions-&-Answers#if-ocsp-request-has-failed
Apr 26 09:32:07     at org.digidoc4j.impl.asic.AsicSignatureFinalizer.validateOcspResponse(AsicSignatureFinalizer.java:168)
Apr 26 09:32:07     at org.digidoc4j.impl.asic.AsicSignatureFinalizer.createSignature(AsicSignatureFinalizer.java:111)
Apr 26 09:32:07     at org.digidoc4j.impl.asic.AsicSignatureFinalizer.finalizeSignature(AsicSignatureFinalizer.java:87)
Apr 26 09:32:07     at org.digidoc4j.DataToSign.finalize(DataToSign.java:93)
cbxp commented 2 years ago

As a side note, talking to developers, it seems that everybody has gotten this problem in production at some point, and everybody had to spend time investigating why OCSP suddenly starts failing after working properly with the same config (pointing to a useless wiki page in this case), then resorting to hourly or even more frequent TSL refreshes as a workaround.

naare commented 2 years ago

Thank you for your valuable input on debugging this problem. The fix has been added to the next release scope, which is currently planned in second part of May.

rsarendus commented 2 years ago

In regards to getting information about the TSL state after the refresh, in the current Digidoc4j (version 4.3.0), the summary of the TSL validation can be queried via configuration.getTSL().getSummary(). An object of type TLValidationJobSummary is returned which can be used in making a decision of whether the TSL refresh has been successful or not.

For the next release of Digidoc4j, we are proposing an option to configure a callback:

public interface TSLRefreshCallback {
  boolean ensureTSLState(TLValidationJobSummary summary);
}

The callback will enable the execution of the same TSL health checking logic regardless of whether the TSL refresh was triggered manually via configuration.getTSL().refresh() or automatically by signing or validation logic, and without the need to explicitly query the summary of the TSL validation process.

The callback will be called after the TSL refresh has been run but before the TSL refresh time is updated. Depending on the state of the TSL, the callback can throw an exception in order to completely stop further processing, or it can return either true or false to indicate whether the TSL refresh time should be updated or not. In case the TSL refresh time is not updated (TSL is marked as expired), an automatic TSL refresh will be triggered again the next time a signing or validation process is accessing the TSL, otherwise an automatic refresh won't be triggered before the next TSL expiration time.

A default implementation of the callback will be provided by Digidoc4j, but creating custom implementations of the callback will also be possible in case some specific behaviour is necessary for deciding whether a TSL refresh has been successful or not.

Any feedback on the proposal is welcome.

tenor-dev commented 2 years ago

We have nothing against a callback, but the default behaviour SHOULD NOT replace current state of TSL with an empty one in case of loading failure.

That means it should continue using the old TSL until it gets a working new one. Whether the refresh was triggered manually or automatically during signing. The library should never be left in an invalid state where it doesn't accept any signatures just because europa.int website did not respond at the time of refresh.

naare commented 2 years ago

If LOTL or any required trusted lists (or all trusted lists) fail to refresh, new default handling will throw an exception which stops the processes that triggered the TSL refresh and marks the TSL in Digidoc4j as expired. Meaning that it is not possible to continue with validation/signing until the refresh succeeds.

This is done to ensure that LOTL and TSL are refreshed in regular intervals (by default 24h). TSL and LOTL themselves have expiration of no longer than 6 months, but it is obligatory for the TSL provider to publish all the changes in trust services within 24h of receiveing the change request. This means that any change with trust service (compromise of CA certificate for example) will be present in TSL on 24h interval. Using TSL for longer period increases the possibility of using outdated information.

It is possible to override this behavior with custom callback function, if the risks of using old TSL info are acceptable for your business case.

tenor-dev commented 2 years ago

It happened again to our production system today, we retried it automatically (as a workaround to this bug), and it was successful the second time:

May 23 12:39:32 12333 [pool-2-thread-1] ERROR AbstractAnalysis - Unable to process GET call for url [https://ec.europa.eu/tools/lotl/eu-lotl.xml]. Reason : [Read timed out]
May 23 12:39:32 12334 [digidoc4j] INFO TLValidationJob - Analysis is DONE for 1 LOTLSource(s)
May 23 12:39:32 12335 [digidoc4j] INFO TLValidationJob - No TL to be analyzed
May 23 12:39:32 12341 [digidoc4j] INFO TLValidationJob - Online refresh is DONE.

I think that a much better default would be the following in case of refresh failure:

  1. If there are already loaded some working certs, then log a warning, but continue using the previous ones (to ensure undisrupted service)
  2. If there is nothing yet, then retry at least once before failing and throwing an exception

Such behaviour will not need tweaking by every developer. That would be a sane default bahavour. And then, if somebody needs something else, let them implement their own callback.

I am pretty sure that most developers want to have a stable system by default. Currently, people discover this problem in production and have to implement workarounds. It would be nice if workarounds would not be needed for most projects.

BTW, as a side note, [pool-2-thread-1] ERROR AbstractAnalysis is not very descriptive, not from the thread name, nor from the class name is it easy to understand that it is a critical error coming from digidoc4j.

naare commented 2 years ago

Thank you for your input on this topic. LOTL/TSL handling is a complicated process handled by multiple software modules. Since we are already in the release process of the version 5.0.0, its not feasible to introduce any more changes right now. However, we are open to all suggestions and we are gathering feedback in order to have more information for considering improvements in TSL handling for the next release.

angryziber commented 2 years ago

@naare what is the new behaviour in case refresh() is not called manually? Will it just keep failing after the first fail? If it's so, it must be documented in bold letters at the beginning of the docs.

naare commented 2 years ago

Digidoc4j 5.0.0 has now been released with changes in TSL loading. The original problem where refresh was not automatically retried when the first load failed (cache was still marked as updated) should be fixed.

Please take a look on the release notes where you can also find references to relevant wiki chapters.

naare commented 1 year ago

I will close the issue as the 5.0.0 has been released for a year and no additional feedback has been given.