Open michalvavrik opened 2 days ago
/cc @geoand (openshift), @iocanel (openshift)
/cc @yrodiere @radcortez
Bear with me, I have never used service binding, but:
Expected behavior
This makes using service binding in production tricky, because you never know how many attempts it will take to deploy pod (e.g. what constellation of starts will be and so on). So IMO if service binding is in place, datasource should not be deactivated.
You are expecting the datasource to not be deactivated, even though no URL was configured? But that means the datasource won't work... Do you just want the app to start without error, but fail on first request?
You are expecting the datasource to not be deactivated, even though no URL was configured? But that means the datasource won't work... Do you just want the app to start without error, but fail on first request?
No, I expect that environment variables set by the service binding operator are used and they contain the datasource URL (or they contain link to where the ds URL value is, don't remember how it works). There is some race.
Regarding triage/needs-feedback
label, @yrodiere , this test worked for years (not sure how long, 2 +?). So I think it's a bug.
For context @yrodiere , our test requests entity from database and receives it, so no failure is expected.
No, I expect that environment variables set by the service binding operator are used and they contain the datasource URL (or they contain link to where the ds URL value is, don't remember how it works). There is some race.
Okay. The checking of whether a datasource is active (and its initialization, when active) happens on CDI startup now. If service binding is manipulating config concurrently to CDI startup, or as part of CDI startup (CDI bean startup order is undefined), that could explain the problem.
@iocanel could you please point us to how/where the datasource URL is set in the service binding extension(s)?
Regarding
triage/needs-feedback
label, @yrodiere , this test worked for years (not sure how long, 2 +?). So I think it's a bug.
That label was just because I was waiting for your answer... Nobody is questioning whether it's a bug, I'm not sure why you're bringing this up.
That label was just because I was waiting for your answer... Nobody is questioning whether it's a bug, I'm not sure why you're bringing this up.
Sorry.
could you please point us to how/where the datasource URL is set in the service binding extension(s)?
It looks that information in a specific file on the file system - see this for example
I think they are just config sources like any other (judging by KubernetesConfigSourceFactoryBuilder
and io.quarkus.kubernetes.service.binding.runtime.DatasourceServiceBindingConfigSourceFactory
). Is there a chance that DS code that determines whether bean is active or inactive is executed before runtime config is ready? I'll leave debugging to you. Thanks for looking into it. I cannot reproduce it locally, but I have 5+ failures (counted so far) in Jenkins, so if you bring potential fix, I suppose I can run it several times and see if it helped.
I haven't touched that code since the service binding operator got deprecated. So, I'll have to dig to refresh my memory. Also, I am wondering if we should just deprecate and drop the extension ourselves. @maxandersen thoughts?
I think they are just config sources like any other
Correct
Also, I am wondering if we should just deprecate and drop the extension ourselves. @maxandersen thoughts?
I was wondering the same...
lets make sure we dont conflate deprecation of service binding operator with support for service binding API's.
I'll check on where service binding apis support actually is outside openshift setup.
But that still leaves a concerns on why the new "inactive" somehow causes race conditions - that shouldn't happen?
@yrodiere I can help you to run it against OCP if you want (give you temporary access to OCP cluster inside VPN), but as I said, it's flaky, it won't happen everytime, so you will need patience.
But that still leaves a concerns on why the new "inactive" somehow causes race conditions - that shouldn't happen?
I don't see how the new "inactive" stuff could be the root cause.
Though introducing that feature required to change when exactly we check the JDBC URL (we do it earlier, I think, during CDI initialization)... so maybe that made some pre-existing race condition between CDI init and config more visible? Seems unlikely, but at this point it's all I have :/
@yrodiere I can help you to run it against OCP if you want (give you temporary access to OCP cluster inside VPN), but as I said, it's flaky, it won't happen everytime, so you will need patience.
I'm afraid I'll also need an ungodly amount of time... And without a debugger I'm not even sure it would help investigate.
Could you perhaps enable more logging so that we can compare what happens on your CI when the bug appears and when it doesn't? I'm thinking of enabling TRACE or at least DEBUG for io.quarkus.datasource
, io.quarkus.agroal
, io.quarkus.datasource
, io.quarkus.config
, io.quarkus.deployment.configuration
, io.quarkus.runtime.configuration
, io.quarkus.kubernetes.service
.
EDIT: Oh, and also io.quarkus.arc
, io.quarkus.deployment.bean
, io.quarkus.runtime.bean
.
Hey @Ladicek @radcortez , can you confirm configuration doesn't depend on CDI at all and thus should be initialized completely once CDI starts initializing beans?
I.e. do you know if the relative order beteen these two events is clearly defined?
Config does not depend on CDI in any way, yes, but I'm not sure if we have a defined ordering between runtime config init and runtime CDI init.
Yes, Config does not depend on CDI in any way, but we had cases where we added such dependencies, which I've removed. I think we don't have such cases anymore.
Remember that Arc
does some things during STATIC_INIT
, so only the STATIC_INIT
configuration is available.
In the presented code, the AgroalRecorder
method executes in RUNTIME
, and the KubernetesConfigSourceFactoryBuilder
is only set to execute for the RUNTIME
config, so it should be fine. If the recorder method was for STATIC_INIT
, then the issue could be there. Maybe some other previous build step in STATIC_INIT
is expecting the Kubernetes config?
Describe the bug
Now, this is very hard to reproduce for me, but our periodic builds testing service binding fails pretty often now that https://github.com/quarkusio/quarkus/pull/41929 is merged. It fails both in JVM and native mode, though native mode seems to fail more often. And both Hibernate ORM and Hibernate Reactive are affected.
Expected behavior
This makes using service binding in production tricky, because you never know how many attempts it will take to deploy pod (e.g. what constellation of starts will be and so on). So IMO if service binding is in place, datasource should not be deactivated.
Actual behavior
There are 2 different behaviors, one for Hibernate ORM, one for Hibernate Reactive. Both results on application startup failure.
Hibernate ORM (native mode logs):
Hibernate Reactive (native mode logs):
How to Reproduce?
Steps to reproduce the behavior:
git clone git@github.com:michalvavrik/quarkus-test-suite.git -b feature/sb-reactive-reproducer
quarkus-test-suite/service-binding/postgresql-crunchy-reactive
(for Hibernate Reactive) orquarkus-test-suite/service-binding/postgresql-crunchy-classic
(for Hibernate ORM)mvn clean verify -Dopenshift
Output of
uname -a
orver
RHEL 8
Output of
java -version
Temurin 21
Quarkus version or git rev
999-SNAPSHOT
Build tool (ie. output of
mvnw --version
orgradlew --version
)Maven 3.9.4
Additional information
No response