Closed peternied closed 7 months ago
Known issue in JDK: https://bugs.openjdk.org/browse/JDK-8221218. Maybe it's been resolved in JDK20
I have the same issue using the latest helm charts and docker images. interestingly it worked for a while, after re-creating the CA and certs it stopped working consistently.
Got the same issue. During cluster migration from 2.8 to 2.9 one of the node could not start. What is the root cause so far is not clear.
[Triage] Going to leave this untriaged since we dont really know how to move forward yet. We can keep the issue though and add more info if we encounter this further.
[Triage] Per @willyborankin's suggestion, you can reproduce it by starting a migration and adding a new node during migration with the same certificate. Any fixes for the issues will be accepted. Likely a change around 1.7.6 or jdk20.
PR with BC 1.76 was merged in OpenSearch.
Hi guys. Problem is still persistent in v2.11.0. I would like to kindly ask you let us know, when fix will be available in particular version.
Also having this issue using latest tag. Note that this rule is off: plugins.security.ssl.transport.enforce_hostname_verification: false
And i am using proper plugins.security.nodes_dn settings.
bug not resolved (15.01.2024), use tls 1.2 instead tls 1.3 use VM arg: -Djdk.tls.client.protocols=TLSv1.2 or if you use netty config ssl handler: SslHandler handler = sslContext.newHandler(socketChannel.alloc()); handler.engine().setEnabledProtocols(new String[] {"TLSv1.2"});
Seems like a bug in JDK: https://bugs.openjdk.java.net/browse/JDK-8221218
See this forum post for more details: https://forum.opensearch.org/t/cluster-does-not-initialize-javax-net-ssl-sslhandshakeexception-insufficient-buffer-remaining-for-aead-cipher-fragment/2845/5
Like others have said this seems to be a known issue with how the JDK handles TLS:
https://bugs.openjdk.org/browse/JDK-8221218
If you look at the comments here, they seem to suggest fixes have occurred but obviously this is not the case... It is also worth pointing out that neither of the fixes were actually intended to address this specific issue. I am not sure why they closed this issue as resolved when the linked changes were for separate bugs...
Further examples of the issue being known:
Oracle support page (https://support.oracle.com/knowledge/Middleware/2519569_1.html)
Applies to: Oracle WebLogic Server - Version 12.1.3.0.0 and later
Another project running into this issue:
https://forum.portswigger.net/thread/complete-proxy-failure-due-to-java-tls-bug-1e334581
Thanks for reporting this. It is a known unresolved bug in OpenJDK
One last attempt to fix this would be looking at increasing the Bouncycastle version:
https://github.com/tkohegyi/mitmJavaProxy/issues/12
I use JDK15 and later + org.bouncycastle/bcpkix-jdk18on/1.71.1 and I cannot repro it anymore
I will try to do this and see if it is possible but I am not sure about reproducing the issue consistently so it may be challenging to test.
@LHozzan @Thrallix @VovkaSOL We've been having no luck with this issue, one thing I'm trying to understand is how impactful this issue is to you. From our evidence it looks like this has only happened during cluster startup. If its a startup issue is unfortunate, but limited in overall impact. Whereas - if this issue happens intermittently on a cluster and takes down a node then we should invest more time, can you help provide use with details of your reproduction?
I am seeing this issue consistently after trying to change cert providers. I did a full cluster restart and I'm getting that error on all of my nodes. I don't know if it's relevant but the old certs we were using were RSA, while the new certs are id-ecPublicKey
@reshippie (any anyone else experience this issue) could you include the operation system version / jdk version / opensearch distro version. Basic cluster topology (3 data nodes, 2 cluster managers). Anything interesting about your security configuration.
If you don't feel conformable posting that information publicly, feel free to reach out to me first on our slack instance, I'm Peter Nied
or email pet ern @ am az on .co m
(remove the spaces)
We're running: Debian 10.13 Opensearch 2.9.0 bundled Java 17.0.7 6 data nodes, 3 managers, 1 coordinating node (for Dashboards)
I don't think there's anything interesting in our security config
plugins.security.ssl_cert_reload_enabled: true
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.advanced_modules_enabled: true
plugins.security.nodes_dn:
- 'CN=dashboards-*-mgmt'
- 'CN=esmaster-*-mgmt'
- 'CN=elasticsearch-*-mgmt'
- 'CN=osdata-*-mgmt'
# Trasnport layer TLS
plugins.security.ssl.transport.enabled: true
plugins.security.ssl.transport.pemkey_filepath: ssl/{{ ansible_hostname }}-mgmt.pk8
plugins.security.ssl.transport.pemcert_filepath: ssl/{{ ansible_hostname }}-mgmt.crt
plugins.security.ssl.transport.pemtrustedcas_filepath: ssl/{{ ansible_hostname }}-mgmt.issuer.crt
plugins.security.ssl.transport.truststore_filepath: cacerts
#
# REST layer TLS
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemkey_filepath: ssl/{{ ansible_hostname }}-mgmt.pk8
plugins.security.ssl.http.pemcert_filepath: ssl/{{ ansible_hostname }}-mgmt.crt
plugins.security.ssl.http.pemtrustedcas_filepath: ssl/{{ ansible_hostname }}-mgmt.issuer.crt
plugins.security.restapi.roles_enabled: ["admin_role", "security_rest_api_access"]
plugins.security.authcz.admin_dn: CN=DOMAIN.org
I tried the solution posted by @VovkaSOL. Adding -Djdk.tls.client.protocols=TLSv1.2
did not make the error go away.
I looked into updating the bouncycastle version as mentioned above. We would need to follow something similar to when it was moved to https://github.com/opensearch-project/OpenSearch/pull/8247
At the time, @willyborankin only bumped to 15to18 because of the multi-release jars. I don't know if it feasible to move past that point/if opensearch can handle the later version. @willyborankin do you know?
I looked into updating the bouncycastle version as mentioned above. We would need to follow something similar to when it was moved to opensearch-project/OpenSearch#8247
At the time, @willyborankin only bumped to 15to18 because of the multi-release jars. I don't know if it feasible to move past that point/if opensearch can handle the later version. @willyborankin do you know?
@scrawfor99 Not sure about it, we still support JDK 1.8 build AFAIK.
@willyborankin, I think 18on will still work with 1.8. I saw you made the swap to 15to18 though and not 18on in the linked PR so was not sure whether you knew what was or was not compatible.
With the updates the bouncy castle, I am going to close this issue as this is the most we can currently do to resolve the exception. Based on some other discussions, the update to bouncy castle should help resolve the failures.
Hi @peternied .
Sorry for delay response.
We've been having no luck with this issue, one thing I'm trying to understand is how impactful this issue is to you. From our evidence it looks like this has only happened during cluster startup. If its a startup issue is unfortunate, but limited in overall impact. Whereas - if this issue happens intermittently on a cluster and takes down a node then we should invest more time, can you help provide use with details of your reproduction?
This problem in our infrastructure occurring random on all nodes roles. If problem occurred only on one coordinator node, second replica is working, but if both replicas are hitting by the problem, there are basically complete cluster useless, no matter, that managers and data nodes are working fine. Same situation, if any another roles are affected in same time or with some delay. We have monitoring and watching, if components before OpenSearch cluster can connect to it, but it is inconvenient.
We actually using default community Docker image opensearchproject/opensearch:2.11.1
, but only little time. We have actually clusters only in AWS and M$ and I can observe same problem on both providers.
Basic cluster topology (3 data nodes, 2 cluster managers). Anything interesting about your security configuration.
The problem occurring in our both using setups. I mean:
Based on my observation it seems, that more often occurring on multirole, but I not have any exact data.
@scrawfor99 OK, lets wait for next release (2.12.x) and hopefully problem will be fixed there. If it will be persistent, I will let you know.
Hi @LHozzan, do you use Wireguard
/IPSec
as an addition encryption mechanism for the communication between nodes? If yes the problem could be related to Wireguard
/IPSec
configurtaion
After installation(2 data node, 1 manager node) with the demo config, I have updated the opensearch.yml with the following
plugins.security.ssl.transport.pemcert_filepath: tls.crt
plugins.security.ssl.transport.pemkey_filepath: tls.key
plugins.security.ssl.transport.pemtrustedcas_filepath: ca.crt
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: tls.crt
plugins.security.ssl.http.pemkey_filepath: tls.key
plugins.security.ssl.http.pemtrustedcas_filepath: ca.crt
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn: ['CN=admin']
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: [all_access, security_rest_api_access]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices:
- .plugins-ml-agent
- .plugins-ml-config
- .plugins-ml-connector
- .plugins-ml-controller
- .plugins-ml-model-group
- .plugins-ml-model
- .plugins-ml-task
- .plugins-ml-conversation-meta
- .plugins-ml-conversation-interactions
- .plugins-ml-memory-meta
- .plugins-ml-memory-message
- .plugins-ml-stop-words
- .opendistro-alerting-config
- .opendistro-alerting-alert*
- .opendistro-anomaly-results*
- .opendistro-anomaly-detector*
- .opendistro-anomaly-checkpoints
- .opendistro-anomaly-detection-state
- .opendistro-reports-*
- .opensearch-notifications-*
- .opensearch-notebooks
- .opensearch-observability
- .ql-datasources
- .opendistro-asynchronous-search-response*
- .replication-metadata-store
- .opensearch-knn-models
- .geospatial-ip2geo-data*
- .plugins-flow-framework-config
- .plugins-flow-framework-templates
- .plugins-flow-framework-state
plugins.security.ssl.http.enabled_protocols:
- "TLSv1.2"
plugins.security.nodes_dn:
- 'CN=node'
Then I ran
/usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh -icl -nhnv \
-cd "/usr/share/opensearch/config/opensearch-security" \
-key "/usr/share/opensearch/config/kirk-key.pem" \
-cert "/usr/share/opensearch/config/kirk.pem" \
-cacert "/usr/share/opensearch/config/root-ca.pem"
After that point, I keep getting errors.
The following makefile generates my keys
keys/root-ca.key:
mkdir -p keys;
openssl genrsa -out keys/root-ca.key 2048;
keys/ca.crt: keys/root-ca.key
openssl req -new -x509 -sha256 -key keys/root-ca.key -out keys/ca.crt -days 730 -subj "/CN=ca.local";
keys/admin.key:
mkdir -p keys;
openssl genrsa -out keys/admin-temp.key 2048;
openssl pkcs8 -inform PEM -outform PEM -in keys/admin-temp.key -topk8 -nocrypt -v1 PBE-SHA1-3DES -out keys/admin.key
rm keys/admin-temp.key;
keys/admin.crt: keys/admin.key keys/ca.crt keys/root-ca.key
openssl req -new -key keys/admin.key -out keys/admin.csr -subj "/CN=admin";
openssl x509 -req -in keys/admin.csr -CA keys/ca.crt -CAkey keys/root-ca.key -CAcreateserial -sha256 -out keys/admin.crt -days 730;
rm keys/admin.csr;
keys/tls.key:
openssl genrsa -out keys/tls-temp.key 2048;
openssl pkcs8 -inform PEM -outform PEM -in keys/tls-temp.key -topk8 -nocrypt -v1 PBE-SHA1-3DES -out keys/tls.key
rm keys/tls-temp.key;
keys/tls.crt: keys/tls.key keys/ca.crt keys/root-ca.key
openssl req -new -key keys/tls.key -out keys/tls.csr -subj "/CN=node";
openssl x509 -req -in keys/tls.csr -CA keys/ca.crt -CAkey keys/root-ca.key -CAcreateserial -sha256 -out keys/tls.crt -days 730;
rm keys/tls.csr;
removeoldkeys:
rm -rf keys;
makekeys: removeoldkeys keys/admin.key keys/admin.crt keys/tls.key keys/tls.crt keys/ca.crt
@echo "Keys are generated.";
I am stuck here for a while, please help! 🙏
I'm seeing errors like this in master node logs:
[2024-06-05T01:05:39,152][INFO ][o.o.s.a.s.DebugSink ] [opensearch-cluster-master-2] AUDIT_LOG: {
"audit_node_id" : "lP5ZYpVDR1O9n8EDWhKe1g",
"audit_request_layer" : "TRANSPORT",
"audit_request_exception_stacktrace" : "javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)\n\tat java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:378)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:321)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:316)\n\tat java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:134)\n\tat java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:736)\n\tat java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:691)\n\tat java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:506)\n\tat java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:482)\n\tat java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679)\n\tat io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:310)\n\tat io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1445)\n\tat io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)\n\tat io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)\n\tat io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: javax.crypto.BadPaddingException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)\n\tat java.base/sun.security.ssl.SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt(SSLCipher.java:1864)\n\tat java.base/sun.security.ssl.SSLEngineInputRecord.decodeInputRecord(SSLEngineInputRecord.java:239)\n\tat java.base/sun.security.ssl.SSLEngineInputRecord.decode(SSLEngineInputRecord.java:196)\n\tat java.base/sun.security.ssl.SSLEngineInputRecord.decode(SSLEngineInputRecord.java:159)\n\tat java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:111)\n\t... 27 more\n",
"@timestamp" : "2024-06-05T01:00:55.484+00:00",
"audit_request_effective_user_is_admin" : false,
"audit_cluster_name" : "opensearch-cluster",
"audit_format_version" : 4,
"audit_node_host_address" : "10.200.2.124",
"audit_node_name" : "opensearch-cluster-master-2",
"audit_category" : "SSL_EXCEPTION",
"audit_request_origin" : "TRANSPORT",
"audit_node_host_name" : "10.200.2.124"
}
Here's the expanded stack trace:
javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:378)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:321)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:316)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:134)
at java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:736)
at java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:691)
at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:506)
at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:482)
at java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:310)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1445)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: javax.crypto.BadPaddingException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at java.base/sun.security.ssl.SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt(SSLCipher.java:1864)
at java.base/sun.security.ssl.SSLEngineInputRecord.decodeInputRecord(SSLEngineInputRecord.java:239)
at java.base/sun.security.ssl.SSLEngineInputRecord.decode(SSLEngineInputRecord.java:196)
at java.base/sun.security.ssl.SSLEngineInputRecord.decode(SSLEngineInputRecord.java:159)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:111)
... 27 more
I'm using container image docker.io/opensearchproject/opensearch:2.14.0@sha256:96af4ace999e20f3f74b1675e501d7dba46f2e7c185cfcffd4626898b00e6743
on linux/arm64
.
I don't think this is fixed. Could someone please re-open?
same error happened here but what I've done that caused this error was using a Cert with SANS for all my cluster nodes... I've used this kind of Cert for other services without any problems...I hope that you guys fix this issue!
Expected result
Should not see errors from underlying system configuration
Additional context