Closed planetf1 closed 2 years ago
Tried so far
alternative jvms
Observed
Also now occuring on data manager:
> Task :open-metadata-test:open-metadata-fvt:access-services-fvt:data-engine-fvt:test
FindFVT > find(String, DataEngineClient, RepositoryService) > org.odpi.openmetadata.accessservices.dataengine.fvt.FindFVT.find(String, DataEngineClient, RepositoryService)[2] FAILED
org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException at FindFVT.java:42
> Task :open-metadata-test:open-metadata-fvt:access-services-fvt:data-manager-fvt:test FAILED
DataManagerOMASIT > testCreateEvents(String) > org.odpi.openmetadata.accessservices.datamanager.fvt.execution.DataManagerOMASIT.testCreateEvents(String)[2] STANDARD_OUT
==========================================
CreateEventsTest FVT (servergraph) failed
Number of Failed tests: 1
Captured Exceptions: 1
==> org.odpi.openmetadata.fvt.utilities.exceptions.FVTUnexpectedCondition with message getEventBroker(no GUID for Retrieve)
FVTUnexpectedCondition{testCaseName='CreateEventsTest', errorDescription='getEventBroker(no GUID for Retrieve)', caughtException=null}
DataManagerOMASIT > testCreateEvents(String) > org.odpi.openmetadata.accessservices.datamanager.fvt.execution.DataManagerOMASIT.testCreateEvents(String)[2] STANDARD_ERROR
FVTUnexpectedCondition{testCaseName='CreateEventsTest', errorDescription='getEventBroker(no GUID for Retrieve)', caughtException=null}
Running more tests in parallel it seems as if every FVT may now be failing with guid search issues. EVERY test most likely. May just be seen as I upgraded the failing image to more cpu/mem resource - so more parallel tests without timeouts.
Noting configuration on failing system:
jonesn@ubuntu-buildtest:~/src/egeria$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment Temurin-11.0.13+8 (build 11.0.13+8)
OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (build 11.0.13+8, mixed mode)
jonesn@ubuntu-buildtest:~/src/egeria$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
jonesn@ubuntu-buildtest:~/src/egeria$ uname -a
Linux ubuntu-buildtest 5.4.0-1010-ibm #11-Ubuntu SMP Wed Dec 8 06:56:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Tips on debugging the graph.
Change properties
jonesn@ubuntu-buildtest:~/j/janusgraph-0.6.0$ cat conf/janusgraph-berkeleyje-lucene.properties | grep analytics
storage.directory=/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley
index.search.directory=/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//searchindex
follow the tips to connect, check vertices - it's then possible to connect remotely
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182] - type ':remote console' to return to local mode
gremlin> graph
==>standardjanusgraph[berkeleyje:/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley]
gremlin> g
==>graphtraversalsource[standardjanusgraph[berkeleyje:/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley], standard]
gremlin> g.V().count()
==>42
A query on the graph (for the analytics modeling FVT) seems to show iBASEMODULE is present within the underlying graph -> https://gist.github.com/443b5f9c1f14b40ce007814b6399c9f0
@planetf1 Can you provide some info on what steps you followed to be able to debug the PR please.
Possible fix merged. Will summarise after
To summarise what happened here in terms of FVT failures
Note: the investigation was delayed by a few days due to writing & running the Egeria dojo -
In summary:
There is probably a relationship to watching inodes
I think this came about because
Over time every worker was updated, and every one failed
Using a github hosted runner we DO NOT have fine grained control over the workers - see https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners
One remaining concern is that the //exact// reason for the failure is not known - we should revisit this, possibly leave until after a further image update in case it's a more transient issue
Closing & will open a new issue to debug & revert back to Ubuntu 20.04
Reopening.
This has continued to occur. Most recently in #6450 I addressed some issues with overlapping port numbers being used in tests. Whilst that was a potential problem, it appears not to be the only cause, since intermittently PR builds have continued to fail.
The most recent failure was at
AssetConsumerOMASCreateTagIT > testCreateTag(String) > org.odpi.openmetadata.accessservices.assetconsumer.fvt.execution.AssetConsumerOMASCreateTagIT.testCreateTag(String)[1] STANDARD_OUT
[6837](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6837)
AssetGUID: accc79a7-1832-4c28-a25e-e5afd490aad3
[6838](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6838)
SchemaTypeGUID: ced2170e-12b4-454f-bf94-d5bbd98f502c
[6839](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6839)
Asset2GUID: c01a71db-158f-405c-8843-f339a592854e
[6840](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6840)
[6841](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6841)
PublicTag1GUID: 33b0b042-4f2a-4671-8fd9-cdd64ce2cff3
[6842](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6842)
[6843](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6843)
PublicTag2GUID: 68122c0f-66ec-4068-8d63-1e733a8408b3
[6844](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6844)
[6845](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6845)
10:38:51.721 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - FFDC Response: TagResponse{tag=null, exceptionClassName='org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException', exceptionCausedBy='null', actionDescription='getTag', relatedHTTPCode=400, exceptionErrorMessage='OMAG-COMMON-400-021 The identifier 950eff2e-98f2-46bb-a7f9-b1bb61d84e75 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server', exceptionErrorMessageId='OMAG-COMMON-400-021', exceptionErrorMessageParameters=[950eff2e-98f2-46bb-a7f9-b1bb61d84e75, InformalTag, newUserId, getTag, Asset Consumer OMAS, serverinmem], exceptionSystemAction='The system is unable to process the request because the identifier is invalid.', exceptionUserAction='Update the caller to provide a correct asset identifier.', exceptionProperties={parameterName=guid}}
[6846](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6846)
10:38:51.753 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - Detected Invalid Parameter Exception in REST Response
[6847](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6847)
org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException: OMAG-COMMON-400-021 The identifier 950eff2e-98f2-46bb-a7f9-b1bb61d84e75 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server
[6848](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6848)
at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.throwInvalidParameterException(RESTExceptionHandler.java:269)
[6849](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6849)
at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.detectAndThrowStandardExceptions(RESTExceptionHandler.java:147)
[6850](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6850)
at org.odpi.openmetadata.accessservices.assetconsumer.client.rest.AssetConsumerRESTClient.callInformalTagGetRESTCall(AssetConsumerRESTClient.java:170)
[6851](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6851)
at org.odpi.openmetadata.accessservices.assetconsumer.client.AssetConsumer.getTag(AssetConsumer.java:1370)
[6852](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6852)
at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.getTagTest(CreateTagTest.java:240)
[6853](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6853)
at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.runIt(CreateTagTest.java:123)
[6854](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6854)
at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.performFVT(CreateTagTest.java:70)
[6855](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6855)
at
Hvaing checked the Asset Consumer FVT code, the exception above is normal. It occurs when a tag is not found -- and in this case, this is exactly what the FVT is verifying (perhaps not ideal that we get an exception in a not found case)
Therefore the error is NOT related to the exceptions above.
Instead the failure point seems to be around:
CategoryIT > testCategory(String) > org.odpi.openmetadata.accessservices.subjectarea.fvt.junit.CategoryIT.testCategory(String)[1] STANDARD_OUT
[7849](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7849)
CategoryFVT runIt started
[7850](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7850)
The message received from the daemon indicates that the daemon has disappeared.
[7851](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7851)
[7852](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7852)
Build request sent: Build{id=94798d2d-5003-4b65-8a8e-1911572cdb64, currentDir=/home/runner/work/egeria/egeria}
[7853](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7853)
Attempting to read last messages from the daemon log...
[7854](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7854)
Daemon pid: 1650
[7855](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7855)
log file: /home/runner/.gradle/daemon/7.4.2/daemon-1650.out.log
[7856](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7856)
----- Last 20 lines from daemon log file - daemon-1650.out.log -----
[7857](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7857)
14:42:04,181 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@23afc725 - URL [jar:file:/home/runner/work/egeria/egeria/open-metadata-resources/open-metadata-samples/access-services-samples/subject-area-client-samples/subject-area-definition-sample/build/libs/subject-area-definition-sample-3.9-SNAPSHOT.jar!/logback.xml] is not of type file
[7858](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7858)
14:42:04,484 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
[7859](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7859)
14:42:04,493 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
[7860](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7860)
14:42:04,550 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
[7861](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7861)
14:42:04,576 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
[7862](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7862)
14:42:04,772 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to OFF
[7863](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7863)
14:42:04,772 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
[7864](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7864)
14:42:04,773 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
[7865](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7865)
14:42:04,777 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@5d3f99d7 - Registering current configuration as safe fallback point
[7866](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7866)
which is the daemon appearing.
The most likely cause of this is resource exhaustion - max processes, files, memory etc
Will disable to gradle build daemon to see if this helps (or provides a clearer error)
change has resulted in clean builds so far. Closing for now and will re-open if the issue returns
Reopening as this is still ongoing (ie #6611 )
Latest change = stable builds for now Closing & will continue to monitor
Occured another twice today, again the gradle daemon disappeared.
Looked through recent gradle changes and issues, and https://github.com/gradle/gradle/issues/19750 seems a possible explanation.
We are setting jvm opts -- and perhaps it may be better not to. This will override gradle's curated defaults. Combined with leaving parallel builds etc off, we should try this.
Cannot be reproduced in an external environment, only on the gradle runners, so this will have to be another exploratory change.
Build logs continue to show a similar failure, as if the JVM - or the gradle process has gone away.
Tried
both of these have been backed off. Harmless but not addressing issue.
Then
tests have been stable, so closing
The gradle PR build is failing with errors such as:
or
This was first seen with PR #6116 -see https://github.com/odpi/egeria/actions/runs/1674299644
At the time of this initial failure other PRs were continuing to pass. This continued for up to 3 days, including the last good build at https://github.com/odpi/egeria/actions/runs/1687765378
Subsequently all PR checks are now failing