opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
71 stars 16 forks source link

dls fails with `DestinationHasFailedJobs` due to missing `jsonl` files in concurrent runs #2099

Closed Jabolol closed 1 month ago

Jabolol commented 1 month ago

Which area(s) are affected? (leave empty if unsure)

No response

To Reproduce

Materialise a partitioned run of any asset that has partitions, such as this one, and select a range of more than 2 entries.

Describe the Bug

When running a partitioned pipeline, dlt intermittently fails with a DestinationHasFailedJobs error. Upon investigation, it appears that the failure is caused by a missing jsonl file in the Google Cloud Storage bucket. For example:

Not found: URI\ gs://oso-dataset-transfer-bucket/open_collective/expenses/1725794458.6399574.1665a80ab2.jsonl

After some retries, the job sometimes succeeds.

This issue occurs specifically when multiple materializations are running concurrently. During this process, dlt converts the data to a jsonl file, which it then attempts to upload to the GCS bucket.

When only a single materialization is running, the jsonl file is successfully uploaded, and the process completes without error. However, when multiple materializations are active, the file is sometimes missing, causing the subsequent LOAD operation in BigQuery to fail due to the missing files.

2024-09-08 17:04:28 +0200 - dagster - ERROR - __ASSET_JOB_0 - f198d474-92f0-4fbb-a30c-5d4a25bcad02 - 64368 - open_collective__expenses - STEP_FAILURE - Execution of step "open_collective__expenses" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "open_collective__expenses"::

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage load when processing package 1725807796.518212 with exception:

<class 'FileNotFoundError'>
[Errno 2] No such file or directory: '/Users/$USER/.dlt/pipelines/open_collective_/load/normalized/1725807796.518212/started_jobs/expenses.b5d4a81f6e.0.reference' -> '/Users/$USER/.dlt/pipelines/open_collective_/load/normalized/1725807796.518212/completed_jobs/expenses.b5d4a81f6e.0.reference'

Expected Behavior

dlt should create the files even when running concurrently.

ravenac95 commented 1 month ago

Oh this is even more strange. I would have expected GCS to be fine with handling any number of concurrent uploads.

Jabolol commented 1 month ago

Oh this is even more strange. I would have expected GCS to be fine with handling any number of concurrent uploads.

GCS does handle upload correctly, the issue is that dlt fails to generate those files when running concurrently.

[Errno 2] No such file or directory: '/Users/$USER/.dlt/pipelines/open_collective_/load/normalized/1725807796.518212/started_jobs/expenses.b5d4a81f6e.0.reference'

In this error, 1725807796.518212 is the asset ID that gets loaded to GCS so that Big Query performs a LOAD, but since dlt does not load that, it does not get uploaded and Big Query fails.

Here's a failing Big Query load, for reference:

image
ravenac95 commented 1 month ago

Ah! I misunderstood. Hrm still very odd. Do we know how large these files are? I’m curious if the k8s node is running out of space locally. They aren’t equipped with very large disk space. I’ll double check the exact spec but if we are generating hundreds of gigabytes of data then it might fail to write but I’d expect a different error