Closed stavrosg11 closed 9 months ago
Validated, S3 ingestion + profiler + test suite works as expected, the logs look like the API issue which is resolved in the latest stable release 👍
Thank you for the validation, @ayush-shah.
I'm not sure which API issue you are referring to, but unfortunately prefix problem seems to persist in version 1.2.4. Am I using the wrong version?
In verifying this bug just now,
examplewithoutfolder
and examplewithfolder
) and folderpath
subdirectory in the latter. examplewithoutfolder
ran successfully while examplewithfolder
failed to finish. Please find the attached logs.
logs_with_folder.txt
logs_without_folder.txtThis trivial example is relevant to my data enrichment Spark processes: it tends to write output in a given directory rather than a file. Thus, having a prefix or folder in the table name, the validation tests fail for Spark enrichment output runs.
I might add that the failing testsuite shows neither SUCCESS or FAILED for the pipeline under the Data Quality view. The failure is shown only in the log of the latest run.
@stavrosg11 based on the logs i see: [2024-01-23T10:02:02.036+0000] {backend.py:86} ERROR - Not Authorized! Invalid Token can you try re deploying the pipeline for that test suite?
@ayush-shah I get that error in every testsuite run whether successful or unsuccessful.
Hmm. Curiously, after redeployment I get a pop-up notification ExampleWithFolder.withfolder.examplewithfolder."folderpath not found
So, the full table ("folderpath/username.csv"
) name is not shown. After refreshing and redeploying, running the testsuite fails just as before without FAILED label.
That seems weird, can you send the logs for the same as well..
I cannot find the log entries relating to UI notifications. However, I can see that the names of test cases (with subdirectories, that is, with slashes) are cut on the main page (please find the attached screen capture). If you click on the name/link, it will not lead to anywhere. That cannot be right.
When I retrieve the pipeline info through the REST API, I learn that the fully qualified name contains an un-escaped/un-encoded slash character, while the quotes are escaped/encoded. Obviously, slash is a valid URL character, but it does not behave as any other character.
...
"service": {
"id": "283ae5d4-96e8-40fa-927c-9763019248ea",
"type": "testSuite",
"name": "ExampleWithFolder.withfolder.examplewithfolder.\"folderpath/username.csv\".testSuite",
"fullyQualifiedName": "ExampleWithFolder.withfolder.examplewithfolder.\"folderpath/username.csv\".testSuite",
"deleted": false,
"href": "http://localhost:8585/api/v1/dataQuality/testSuites/283ae5d4-96e8-40fa-927c-9763019248ea"
}, ...
In the log file of the failed test suite there is an URL for pipeline status request that results in HTTP 404.
curl -v -XGET -H "Authorization: Bearer $OMDTOKEN" http://openmetadata:8585/api/v1/services/ingestionPipelines/ExampleWithFolder.withfolder.examplewithfolder.%22folderpath/username.csv%22.testSuite.folderpath_username_csv_TestSuite/pipelineStatus
[openmetadata.log]
INFO [2024-01-24 08:23:44,564] [dw-6158 - GET /api/v1/services/ingestionPipelines/name/ExampleWithFolder.withfolder.examplewithfolder.%22folderpath/username.csv%22.testSuite] o.o.s.e.CatalogGenericExceptionMapper - exception
javax.ws.rs.NotFoundException: HTTP 404 Not Found
If I run the same request to the successful test case WITHOUT a subdirectory I get a null pointer exception. So the service is found, but I am using it wrong. I should be providing a timestamp or something, but I cannot find the spec for the REST API, so I am just poking around. The point is: the service is found, i.e., the URL seems to be valid when there is no slash in the name.
curl -v -XGET -H "Authorization: Bearer $OMDTOKEN" http://openmetadata:8585/api/v1/services/ingestionPipelines/ExampleWithoutFolder.withoutfolder.examplewithoutfolder.%22username.csv%22.testSuite.username_csv_TestSuite/pipelineStatus
[openmetadata.log]
ERROR [2024-01-24 08:25:48,124] [dw-6162 - GET /api/v1/services/ingestionPipelines/ExampleWithoutFolder.withoutfolder.examplewithoutfolder.%22username.csv%22.testSuite.username_csv_TestSuite/pipelineStatus] o.o.s.r.s.i.IngestionPipelineResource - Got exception: [NullPointerException] / message [startTs is marked non-null but is null] / related resource location: [org.openmetadata.service.resources.services.ingestionpipelines.IngestionPipelineResource.listPipelineStatuses](IngestionPipelineResource.java:794)
If I add yet-another S3 Minio bucket containing a file with the hash character in the datafile name, the processing yields yet-another error.
INFO [2024-01-24 10:47:02,168] [dw-6598 - POST /api/v1/services/ingestionPipelines/deploy/e6ede076-f9e8-4b5b-ad26-f990b5324a02] o.o.s.j.EntityRepository - Updated ingestionPipeline:e6ede076-f9e8-4b5b-ad26-f990b5324a02:ExampleWithHash.default.examplewithhash."user#name.csv".testSuite.user_name_csv_TestSuite
INFO [2024-01-24 10:47:18,596] [dw-6632 - PUT /api/v1/services/ingestionPipelines/ExampleWithHash.default.examplewithhash.%22user] o.o.s.e.CatalogGenericExceptionMapper - exception
javax.ws.rs.NotAllowedException: HTTP 405 Method Not Allowed
...
Caused by: java.lang.IllegalArgumentException: UUID string too large
Filenames with hashes is bad practice, but the purpose was to test if OpenMetadata encodes the filenames or not. Apparently, it does not. In my initial bug report, I presumed the slash deriving from a subdirectory was the cause of the problem.
If this problem cannot be reproduced, then I have done something wrong with my run-of-the-mill all-defaults installation.
@stavrosg11 we have modified a lot of things on our latest main which will be part of 1.3.0 release. If you want to test out the same, you can checkout main, start the server locally and test the same thing out, the links on the UI will also work as we made changes to the entity link
Alright, thank you for your help. 👍
Affected module The data quality ingestions of a datalake service.
Describe the bug When a database of S3 Datalake type is created with a prefix (i.e., the data resides in a subdirectory), the ingestion run fails to start.
To Reproduce
The test fails to start.
Perhaps the slash contained in the table name results in HTTP 500 error (see below). Tests for similar S3 Datalake source without a subdirectory, i.e., a database where the data resides at the bucket root, run fine.
Expected behavior One would expect the tests to complete, but the tests fail to start.
Version:
openmetadata-ingestion[docker]==XYZ
]Additional context