odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
808 stars 261 forks source link

Gradle PR build failing #6130

Closed planetf1 closed 2 years ago

planetf1 commented 2 years ago

The gradle PR build is failing with errors such as:

AssetConsumerOMASCreateTagIT > testCreateTag(String) > org.odpi.openmetadata.accessservices.assetconsumer.fvt.execution.AssetConsumerOMASCreateTagIT.testCreateTag(String)[1] STANDARD_OUT
7785
    AssetGUID: 4358845d-1017-495c-b944-d9b8a33604da
7786
    SchemaTypeGUID: acd45be1-5e35-4415-92bc-82a00567b89b
7787
    Asset2GUID: 8d9c1bde-c967-4743-b6d0-063aacd2b2a4
7788

7789
    11:11:01.864 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - FFDC Response: TagResponse{tag=null, exceptionClassName='org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException', exceptionCausedBy='null', actionDescription='getTag', relatedHTTPCode=400, exceptionErrorMessage='OMAG-COMMON-400-021 The identifier 87b16829-7bad-4181-8ab5-36df81c54703 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server', exceptionErrorMessageId='OMAG-COMMON-400-021', exceptionErrorMessageParameters=[87b16829-7bad-4181-8ab5-36df81c54703, InformalTag, newUserId, getTag, Asset Consumer OMAS, serverinmem], exceptionSystemAction='The system is unable to process the request because the identifier is invalid.', exceptionUserAction='Update the caller to provide a correct asset identifier.', exceptionProperties={parameterName=guid}}
7790
    11:11:01.883 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - Detected Invalid Parameter Exception in REST Response
7791
    org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException: OMAG-COMMON-400-021 The identifier 87b16829-7bad-4181-8ab5-36df81c54703 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server

or

AnalyticsModelingOMASSynchronizationIT > testSynchronization(String) > org.odpi.openmetadata.accessservices.analyticsmodeling.fvt.execution.AnalyticsModelingOMASSynchronizationIT.testSynchronization(String)[2] STANDARD_ERROR
8155
    FVTUnexpectedCondition{testCaseName='SynchronizationTest', errorDescription='testUpdateArtifact', caughtException=PropertyServerException{reportedHTTPCode=400, reportingClassName='org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler', reportingActionDescription='updateArtifact', reportedErrorMessage='OMAG-COMMON-400-016 An unexpected org.odpi.openmetadata.accessservices.analyticsmodeling.ffdc.exceptions.AnalyticsModelingCheckedException exception was caught by updateArtifact; error message was OMAS-ANALYTICS-MODELING-031 User testUser failed to update artifact iBASEMODULE. The artifact is unknown.', reportedErrorMessageId='OMAG-COMMON-400-016', reportedErrorMessageParameters=[org.odpi.openmetadata.accessservices.analyticsmodeling.ffdc.exceptions.AnalyticsModelingCheckedException, updateArtifact, OMAS-ANALYTICS-MODELING-031 User testUser failed to update artifact iBASEMODULE. The artifact is unknown.], reportedSystemAction='The system is unable to process the request and has returned an exception to the caller.', reportedUserAction='Review the error message.  Also look up its full message definition which includes the system action and user action.  This is most likely to describe the correct action to take to resolve the error.  If that does not help, look for other diagnostics created at the same time.  Also validate that the caller is a valid client of this server and is operating correctly.', reportedCaughtException=null, reportedCaughtExceptionClassName='org.odpi.openmetadata.accessservices.analyticsmodeling.ffdc.exceptions.AnalyticsModelingCheckedException', relatedProperties=null}}
8156
        at org.odpi.openmetadata.accessservices.analyticsmodeling.fvt.synchronization.SynchronizationTest.testUpdateArtifact(SynchronizationTest.java:149)

This was first seen with PR #6116 -see https://github.com/odpi/egeria/actions/runs/1674299644

At the time of this initial failure other PRs were continuing to pass. This continued for up to 3 days, including the last good build at https://github.com/odpi/egeria/actions/runs/1687765378

Subsequently all PR checks are now failing

planetf1 commented 2 years ago

Tried so far

planetf1 commented 2 years ago

Also now occuring on data manager:

> Task :open-metadata-test:open-metadata-fvt:access-services-fvt:data-engine-fvt:test

FindFVT > find(String, DataEngineClient, RepositoryService) > org.odpi.openmetadata.accessservices.dataengine.fvt.FindFVT.find(String, DataEngineClient, RepositoryService)[2] FAILED
    org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException at FindFVT.java:42

> Task :open-metadata-test:open-metadata-fvt:access-services-fvt:data-manager-fvt:test FAILED

DataManagerOMASIT > testCreateEvents(String) > org.odpi.openmetadata.accessservices.datamanager.fvt.execution.DataManagerOMASIT.testCreateEvents(String)[2] STANDARD_OUT
    ==========================================
    CreateEventsTest FVT (servergraph) failed
    Number of Failed tests: 1
    Captured Exceptions: 1
     ==> org.odpi.openmetadata.fvt.utilities.exceptions.FVTUnexpectedCondition with message getEventBroker(no GUID for Retrieve)
    FVTUnexpectedCondition{testCaseName='CreateEventsTest', errorDescription='getEventBroker(no GUID for Retrieve)', caughtException=null}

DataManagerOMASIT > testCreateEvents(String) > org.odpi.openmetadata.accessservices.datamanager.fvt.execution.DataManagerOMASIT.testCreateEvents(String)[2] STANDARD_ERROR
    FVTUnexpectedCondition{testCaseName='CreateEventsTest', errorDescription='getEventBroker(no GUID for Retrieve)', caughtException=null}
planetf1 commented 2 years ago

Running more tests in parallel it seems as if every FVT may now be failing with guid search issues. EVERY test most likely. May just be seen as I upgraded the failing image to more cpu/mem resource - so more parallel tests without timeouts.

planetf1 commented 2 years ago

Noting configuration on failing system:

jonesn@ubuntu-buildtest:~/src/egeria$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment Temurin-11.0.13+8 (build 11.0.13+8)
OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (build 11.0.13+8, mixed mode)
jonesn@ubuntu-buildtest:~/src/egeria$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
jonesn@ubuntu-buildtest:~/src/egeria$ uname -a
Linux ubuntu-buildtest 5.4.0-1010-ibm #11-Ubuntu SMP Wed Dec 8 06:56:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
planetf1 commented 2 years ago

Tips on debugging the graph.

Change properties

jonesn@ubuntu-buildtest:~/j/janusgraph-0.6.0$ cat conf/janusgraph-berkeleyje-lucene.properties | grep analytics
storage.directory=/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley
index.search.directory=/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//searchindex

follow the tips to connect, check vertices - it's then possible to connect remotely

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182] - type ':remote console' to return to local mode
gremlin> graph
==>standardjanusgraph[berkeleyje:/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley]
gremlin> g
==>graphtraversalsource[standardjanusgraph[berkeleyje:/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley], standard]
gremlin> g.V().count()
==>42
planetf1 commented 2 years ago

A query on the graph (for the analytics modeling FVT) seems to show iBASEMODULE is present within the underlying graph -> https://gist.github.com/443b5f9c1f14b40ce007814b6399c9f0

wbittles commented 2 years ago

@planetf1 Can you provide some info on what steps you followed to be able to debug the PR please.

planetf1 commented 2 years ago

Possible fix merged. Will summarise after

planetf1 commented 2 years ago

To summarise what happened here in terms of FVT failures

Note: the investigation was delayed by a few days due to writing & running the Egeria dojo -

In summary:

planetf1 commented 2 years ago

Reopening.

This has continued to occur. Most recently in #6450 I addressed some issues with overlapping port numbers being used in tests. Whilst that was a potential problem, it appears not to be the only cause, since intermittently PR builds have continued to fail.

The most recent failure was at

AssetConsumerOMASCreateTagIT > testCreateTag(String) > org.odpi.openmetadata.accessservices.assetconsumer.fvt.execution.AssetConsumerOMASCreateTagIT.testCreateTag(String)[1] STANDARD_OUT
[6837](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6837)
    AssetGUID: accc79a7-1832-4c28-a25e-e5afd490aad3
[6838](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6838)
    SchemaTypeGUID: ced2170e-12b4-454f-bf94-d5bbd98f502c
[6839](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6839)
    Asset2GUID: c01a71db-158f-405c-8843-f339a592854e
[6840](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6840)

[6841](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6841)
    PublicTag1GUID: 33b0b042-4f2a-4671-8fd9-cdd64ce2cff3
[6842](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6842)

[6843](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6843)
    PublicTag2GUID: 68122c0f-66ec-4068-8d63-1e733a8408b3
[6844](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6844)

[6845](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6845)
    10:38:51.721 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - FFDC Response: TagResponse{tag=null, exceptionClassName='org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException', exceptionCausedBy='null', actionDescription='getTag', relatedHTTPCode=400, exceptionErrorMessage='OMAG-COMMON-400-021 The identifier 950eff2e-98f2-46bb-a7f9-b1bb61d84e75 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server', exceptionErrorMessageId='OMAG-COMMON-400-021', exceptionErrorMessageParameters=[950eff2e-98f2-46bb-a7f9-b1bb61d84e75, InformalTag, newUserId, getTag, Asset Consumer OMAS, serverinmem], exceptionSystemAction='The system is unable to process the request because the identifier is invalid.', exceptionUserAction='Update the caller to provide a correct asset identifier.', exceptionProperties={parameterName=guid}}
[6846](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6846)
    10:38:51.753 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - Detected Invalid Parameter Exception in REST Response
[6847](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6847)
    org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException: OMAG-COMMON-400-021 The identifier 950eff2e-98f2-46bb-a7f9-b1bb61d84e75 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server
[6848](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6848)
        at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.throwInvalidParameterException(RESTExceptionHandler.java:269)
[6849](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6849)
        at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.detectAndThrowStandardExceptions(RESTExceptionHandler.java:147)
[6850](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6850)
        at org.odpi.openmetadata.accessservices.assetconsumer.client.rest.AssetConsumerRESTClient.callInformalTagGetRESTCall(AssetConsumerRESTClient.java:170)
[6851](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6851)
        at org.odpi.openmetadata.accessservices.assetconsumer.client.AssetConsumer.getTag(AssetConsumer.java:1370)
[6852](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6852)
        at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.getTagTest(CreateTagTest.java:240)
[6853](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6853)
        at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.runIt(CreateTagTest.java:123)
[6854](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6854)
        at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.performFVT(CreateTagTest.java:70)
[6855](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6855)
        at
planetf1 commented 2 years ago

Hvaing checked the Asset Consumer FVT code, the exception above is normal. It occurs when a tag is not found -- and in this case, this is exactly what the FVT is verifying (perhaps not ideal that we get an exception in a not found case)

Therefore the error is NOT related to the exceptions above.

Instead the failure point seems to be around:

CategoryIT > testCategory(String) > org.odpi.openmetadata.accessservices.subjectarea.fvt.junit.CategoryIT.testCategory(String)[1] STANDARD_OUT
[7849](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7849)
    CategoryFVT runIt started
[7850](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7850)
The message received from the daemon indicates that the daemon has disappeared.
[7851](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7851)

[7852](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7852)
Build request sent: Build{id=94798d2d-5003-4b65-8a8e-1911572cdb64, currentDir=/home/runner/work/egeria/egeria}
[7853](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7853)
Attempting to read last messages from the daemon log...
[7854](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7854)
Daemon pid: 1650
[7855](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7855)
  log file: /home/runner/.gradle/daemon/7.4.2/daemon-1650.out.log
[7856](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7856)
----- Last  20 lines from daemon log file - daemon-1650.out.log -----
[7857](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7857)
    14:42:04,181 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@23afc725 - URL [jar:file:/home/runner/work/egeria/egeria/open-metadata-resources/open-metadata-samples/access-services-samples/subject-area-client-samples/subject-area-definition-sample/build/libs/subject-area-definition-sample-3.9-SNAPSHOT.jar!/logback.xml] is not of type file
[7858](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7858)
    14:42:04,484 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
[7859](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7859)
    14:42:04,493 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
[7860](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7860)
    14:42:04,550 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
[7861](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7861)
    14:42:04,576 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
[7862](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7862)
    14:42:04,772 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to OFF
[7863](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7863)
    14:42:04,772 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
[7864](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7864)
    14:42:04,773 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
[7865](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7865)
    14:42:04,777 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@5d3f99d7 - Registering current configuration as safe fallback point
[7866](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7866)

which is the daemon appearing.

The most likely cause of this is resource exhaustion - max processes, files, memory etc

Will disable to gradle build daemon to see if this helps (or provides a clearer error)

planetf1 commented 2 years ago

change has resulted in clean builds so far. Closing for now and will re-open if the issue returns

planetf1 commented 2 years ago

Reopening as this is still ongoing (ie #6611 )

planetf1 commented 2 years ago

Latest change = stable builds for now Closing & will continue to monitor

planetf1 commented 2 years ago

Occured another twice today, again the gradle daemon disappeared.

Looked through recent gradle changes and issues, and https://github.com/gradle/gradle/issues/19750 seems a possible explanation.

We are setting jvm opts -- and perhaps it may be better not to. This will override gradle's curated defaults. Combined with leaving parallel builds etc off, we should try this.

Cannot be reproduced in an external environment, only on the gradle runners, so this will have to be another exploratory change.

planetf1 commented 2 years ago

Build logs continue to show a similar failure, as if the JVM - or the gradle process has gone away.

Tried

both of these have been backed off. Harmless but not addressing issue.

Then

planetf1 commented 2 years ago

tests have been stable, so closing