open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.51k stars 1.04k forks source link

S3 Datalake test suite ingestions fail if the database contains a prefix #13045

Closed stavrosg11 closed 9 months ago

stavrosg11 commented 1 year ago

Affected module The data quality ingestions of a datalake service.

Describe the bug When a database of S3 Datalake type is created with a prefix (i.e., the data resides in a subdirectory), the ingestion run fails to start.

To Reproduce

  1. Create S3 bucket (on Minio) with a subdirectory. Upload a piece of CSV data in that directory.
  2. Create a database service on Open Metadata for the S3 database. Add the subdirectory as a prefix.
  3. Run metadata and profiler ingestions for reading the data as a table.
  4. Add test for the table.
  5. Add ingestion for the test.
  6. Run the test ingestion.

The test fails to start.

Perhaps the slash contained in the table name results in HTTP 500 error (see below). Tests for similar S3 Datalake source without a subdirectory, i.e., a database where the data resides at the bucket root, run fine.

[2023-08-31T05:32:53.644+0000] {taskinstance.py:1308} INFO - Starting attempt 1 of 1
[2023-08-31T05:32:53.665+0000] {taskinstance.py:1327} INFO - Executing <Task(PythonOperator): test_suite_task> on 2023-08-31 05:32:38+00:00
[2023-08-31T05:32:53.669+0000] {standard_task_runner.py:57} INFO - Started process 112 to run task
[2023-08-31T05:32:53.672+0000] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'raw_titanic_csv_TestSuite', 'test_suite_task', 'manual__2023-08-31T05:32:38+00:00', '--job-id', '164', '--raw', '--subdir', 'DAGS_FOLDER/raw_titanic_csv_TestSuite.py', '--cfg-path', '/tmp/tmpg2w8zilg']
[2023-08-31T05:32:53.672+0000] {standard_task_runner.py:85} INFO - Job 164: Subtask test_suite_task
[2023-08-31T05:32:53.984+0000] {task_command.py:410} INFO - Running <TaskInstance: raw_titanic_csv_TestSuite.test_suite_task manual__2023-08-31T05:32:38+00:00 [running]> on host raw-titanic-csv-testsuite-test-suite-task-ndd4g8ys
[2023-08-31T05:32:55.117+0000] {pod_generator.py:529} WARNING - Model file /opt/airflow/pod_templates/pod_template.yaml does not exist
[2023-08-31T05:32:55.161+0000] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='openmetadata' AIRFLOW_CTX_DAG_ID='raw_titanic_csv_TestSuite' AIRFLOW_CTX_TASK_ID='test_suite_task' AIRFLOW_CTX_EXECUTION_DATE='2023-08-31T05:32:38+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-08-31T05:32:38+00:00'
[2023-08-31T05:32:55.187+0000] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/openmetadata_managed_apis/workflows/ingestion/test_suite.py", line 47, in test_suite_workflow
    workflow = TestSuiteWorkflow.create(config)
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/data_quality/api/workflow.py", line 142, in create
    return cls(config)
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/data_quality/api/workflow.py", line 114, in __init__
    self.set_ingestion_pipeline_status(state=PipelineState.running)
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/workflow/workflow_status_mixin.py", line 78, in set_ingestion_pipeline_status
    self.metadata.create_or_update_pipeline_status(
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/ingestion/ometa/mixins/ingestion_pipeline_mixin.py", line 47, in create_or_update_pipeline_status
    resp = self.client.put(
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/ingestion/ometa/client.py", line 288, in put
    return self._request("PUT", path, data)
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/ingestion/ometa/client.py", line 189, in _request
    return self._one_request(method, url, opts, retry)
  File "/home/airflow/.local/lib/python3.9/site-packages/metadata/ingestion/ometa/client.py", line 212, in _one_request
    resp.raise_for_status()
  File "/home/airflow/.local/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://openmetadata:8585/api/v1/services/ingestionPipelines/S3MinioTitanicRaw.TitanicRaw.titanicraw.%22raw/titanic.csv%22.testSuite.raw_titanic_csv_TestSuite/pipelineStatus
[2023-08-31T05:32:55.194+0000] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=raw_titanic_csv_TestSuite, task_id=test_suite_task, execution_date=20230831T053238, start_date=20230831T053253, end_date=20230831T053255
[2023-08-31T05:32:55.194+0000] {common.py:311} INFO - Sending failed status from callback...
[2023-08-31T05:32:55.208+0000] {common.py:317} INFO - Sending status to Ingestion Pipeline S3MinioTitanicRaw.TitanicRaw.titanicraw."raw/titanic.csv".testSuite.raw_titanic_csv_TestSuite
[2023-08-31T05:32:55.225+0000] {standard_task_runner.py:104} ERROR - Failed to execute job 164 for task test_suite_task ('functools.partial' object has no attribute '__module__'; 112)
[2023-08-31T05:32:55.246+0000] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2023-08-31T05:32:55.259+0000] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check

Expected behavior One would expect the tests to complete, but the tests fail to start.

Version:

Additional context

ayush-shah commented 9 months ago

Validated, S3 ingestion + profiler + test suite works as expected, the logs look like the API issue which is resolved in the latest stable release 👍

stavrosg11 commented 9 months ago

Thank you for the validation, @ayush-shah.

I'm not sure which API issue you are referring to, but unfortunately prefix problem seems to persist in version 1.2.4. Am I using the wrong version?

In verifying this bug just now,

  1. I created two buckets on Minio S3 (i.e., buckets examplewithoutfolder and examplewithfolder) and
  2. placed the same trivial CSV file in each of them, in the root of the bucket in the former and in a folderpath subdirectory in the latter.
  3. I created new services of type Datalake for each of them in OpenMetadata.
  4. I added metadata and profiling ingestion for both of them successfully.
  5. I added trivial row count tests for each of the tables, and examplewithoutfolder ran successfully while examplewithfolder failed to finish. Please find the attached logs. logs_with_folder.txt logs_without_folder.txt

This trivial example is relevant to my data enrichment Spark processes: it tends to write output in a given directory rather than a file. Thus, having a prefix or folder in the table name, the validation tests fail for Spark enrichment output runs.

stavrosg11 commented 9 months ago

I might add that the failing testsuite shows neither SUCCESS or FAILED for the pipeline under the Data Quality view. The failure is shown only in the log of the latest run.

ayush-shah commented 9 months ago

@stavrosg11 based on the logs i see: [2024-01-23T10:02:02.036+0000] {backend.py:86} ERROR - Not Authorized! Invalid Token can you try re deploying the pipeline for that test suite?

stavrosg11 commented 9 months ago

@ayush-shah I get that error in every testsuite run whether successful or unsuccessful.

Hmm. Curiously, after redeployment I get a pop-up notification ExampleWithFolder.withfolder.examplewithfolder."folderpath not found

So, the full table ("folderpath/username.csv") name is not shown. After refreshing and redeploying, running the testsuite fails just as before without FAILED label.

ayush-shah commented 9 months ago

That seems weird, can you send the logs for the same as well..

stavrosg11 commented 9 months ago

I cannot find the log entries relating to UI notifications. However, I can see that the names of test cases (with subdirectories, that is, with slashes) are cut on the main page (please find the attached screen capture). If you click on the name/link, it will not lead to anywhere. That cannot be right. name_cut_at_slash

When I retrieve the pipeline info through the REST API, I learn that the fully qualified name contains an un-escaped/un-encoded slash character, while the quotes are escaped/encoded. Obviously, slash is a valid URL character, but it does not behave as any other character.

...
    "service": {
        "id": "283ae5d4-96e8-40fa-927c-9763019248ea",
        "type": "testSuite",
        "name": "ExampleWithFolder.withfolder.examplewithfolder.\"folderpath/username.csv\".testSuite",
        "fullyQualifiedName": "ExampleWithFolder.withfolder.examplewithfolder.\"folderpath/username.csv\".testSuite",
        "deleted": false,
        "href": "http://localhost:8585/api/v1/dataQuality/testSuites/283ae5d4-96e8-40fa-927c-9763019248ea"
    }, ...

In the log file of the failed test suite there is an URL for pipeline status request that results in HTTP 404.

curl -v -XGET -H "Authorization: Bearer $OMDTOKEN" http://openmetadata:8585/api/v1/services/ingestionPipelines/ExampleWithFolder.withfolder.examplewithfolder.%22folderpath/username.csv%22.testSuite.folderpath_username_csv_TestSuite/pipelineStatus

[openmetadata.log]
INFO [2024-01-24 08:23:44,564] [dw-6158 - GET /api/v1/services/ingestionPipelines/name/ExampleWithFolder.withfolder.examplewithfolder.%22folderpath/username.csv%22.testSuite] o.o.s.e.CatalogGenericExceptionMapper - exception
javax.ws.rs.NotFoundException: HTTP 404 Not Found

If I run the same request to the successful test case WITHOUT a subdirectory I get a null pointer exception. So the service is found, but I am using it wrong. I should be providing a timestamp or something, but I cannot find the spec for the REST API, so I am just poking around. The point is: the service is found, i.e., the URL seems to be valid when there is no slash in the name.

curl -v -XGET -H "Authorization: Bearer $OMDTOKEN" http://openmetadata:8585/api/v1/services/ingestionPipelines/ExampleWithoutFolder.withoutfolder.examplewithoutfolder.%22username.csv%22.testSuite.username_csv_TestSuite/pipelineStatus

[openmetadata.log]
ERROR [2024-01-24 08:25:48,124] [dw-6162 - GET /api/v1/services/ingestionPipelines/ExampleWithoutFolder.withoutfolder.examplewithoutfolder.%22username.csv%22.testSuite.username_csv_TestSuite/pipelineStatus] o.o.s.r.s.i.IngestionPipelineResource - Got exception: [NullPointerException] / message [startTs is marked non-null but is null] / related resource location: [org.openmetadata.service.resources.services.ingestionpipelines.IngestionPipelineResource.listPipelineStatuses](IngestionPipelineResource.java:794)

If I add yet-another S3 Minio bucket containing a file with the hash character in the datafile name, the processing yields yet-another error.

INFO [2024-01-24 10:47:02,168] [dw-6598 - POST /api/v1/services/ingestionPipelines/deploy/e6ede076-f9e8-4b5b-ad26-f990b5324a02] o.o.s.j.EntityRepository - Updated ingestionPipeline:e6ede076-f9e8-4b5b-ad26-f990b5324a02:ExampleWithHash.default.examplewithhash."user#name.csv".testSuite.user_name_csv_TestSuite
INFO [2024-01-24 10:47:18,596] [dw-6632 - PUT /api/v1/services/ingestionPipelines/ExampleWithHash.default.examplewithhash.%22user] o.o.s.e.CatalogGenericExceptionMapper - exception 
javax.ws.rs.NotAllowedException: HTTP 405 Method Not Allowed
...
Caused by: java.lang.IllegalArgumentException: UUID string too large

Filenames with hashes is bad practice, but the purpose was to test if OpenMetadata encodes the filenames or not. Apparently, it does not. In my initial bug report, I presumed the slash deriving from a subdirectory was the cause of the problem.

If this problem cannot be reproduced, then I have done something wrong with my run-of-the-mill all-defaults installation.

ayush-shah commented 9 months ago

@stavrosg11 we have modified a lot of things on our latest main which will be part of 1.3.0 release. If you want to test out the same, you can checkout main, start the server locally and test the same thing out, the links on the UI will also work as we made changes to the entity link

stavrosg11 commented 9 months ago

Alright, thank you for your help. 👍