slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
301 stars 16 forks source link

Writing to BigQuery fails only during initial replication #230

Closed cbini closed 3 months ago

cbini commented 3 months ago

Trying Sling out for the first time, and I'm noticing that the first time I try to replicate a table, it attempts to create the schema twice. This results in BigQuery throwing this error: googleapi: Error 409: Already Exists: Dataset teamster-332318:staging_oracle_powerschool, duplicate

However, it executes successfully on subsequent runs. It looks like it's trying to create the schema twice on the initial run.

Here are the relevant logs:

2024-03-15 17:39:30 INF writing to target database [mode: full-refresh]
2024-03-15 17:39:30 DBG create schema `staging_oracle_powerschool`
2024-03-15 17:39:31 DBG drop table if exists `staging_oracle_powerschool`.`students_tmp`
2024-03-15 17:39:32 DBG table `staging_oracle_powerschool`.`students_tmp` dropped
2024-03-15 17:39:33 DBG create schema `staging_oracle_powerschool`
2024-03-15 17:39:33 INF execution failed
fatal:
--- sling_cli.go:402 func1 ---
--- sling_cli.go:448 cliInit ---
--- cli.go:284 CliProcess ---
~ failure running replication (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_logic.go:197 processRun ---

--------------------------- students ---------------------------
~ execution failed
--- task_run.go:98 func1 ---
~ Could not WriteToDb
--- task_run.go:526 runDbToDb ---
~ could not create temp table `staging_oracle_powerschool`.`students_tmp`
--- task_run_write.go:192 WriteToDb ---
~ Error checking & creating schema staging_oracle_powerschool
--- task_func.go:63 createTableIfNotExists ---
~ Error creating schema `staging_oracle_powerschool`
--- task_func.go:42 createSchemaIfNotExists ---
~ Could not execute SQL
--- database.go:1067 Exec ---
~ Error executing create schema `staging_oracle_powerschool`
--- database_bigquery.go:243 ExecContext ---
googleapi: Error 409: Already Exists: Dataset teamster-332318:staging_oracle_powerschool, duplicate
flarco commented 3 months ago

Thanks, that's interesting. Should detect that schema already exists (logic here) Can you share the replication config you're using?

cbini commented 3 months ago

Here you go! I should add that I'm running these via Dagster, if that makes any difference. All that does is sling run via the Python package.

source: ORACLE_POWERSCHOOL
target: BIGQUERY_TEAMSTER

defaults:
  mode: full-refresh
  object: "staging_{source_name}.{stream_table}"

streams:
  students:
    select:
      - -custom
flarco commented 3 months ago

Ah I see stream students doesn't have a schema. So you're probably defining that in your connection creds, correct? Will test that on my side.

cbini commented 3 months ago

Yeah, that's right. I have a default schema set on the ORACLE_POWERSCHOOL connection. Also, BIGQUERY_TEAMSTER is required to have a dataset, but I'm overriding that in defaults.object

flarco commented 3 months ago

I'm not able to reproduce. Can you try running the replication with the CLI (not dagster)?

sling run -r /path/to/replication.yaml -d

flarco commented 3 months ago

Actually, nevermind. Just reproduced. It's the staging_{source_name} part in the object. Will fix, thanks!

flarco commented 3 months ago

Fixed in https://github.com/slingdata-io/sling-cli/pull/237.