slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
299 stars 16 forks source link

`_sling_loaded_at` added to table when loading files from s3/gcs (mode: full-refresh) #275

Closed justinbricker-acuitymd closed 2 months ago

justinbricker-acuitymd commented 2 months ago

Issue Description

source: TEST_JUSTIN_GCS_BUCKET.    # GCS (also happening on S3)
target: TEST_TARGET__APP_DB_DEV    #postgres (also happening on BQ)

defaults:
  mode: full-refresh
  object: public.test_sling

streams:
  "sample.csv":     # Also happening with parquet files

env:
  SLING_LOADED_AT_COLUMN: false
  SLING_STREAM_URL_COLUMN: false     # Toggling this flag produces the expected results
2024-04-25 17:25:58 INF Sling Replication [1 streams] | TEST_JUSTIN_GCS_BUCKET -> TEST_TARGET__APP_DB_DEV

2024-04-25 17:25:58 INF [1 [/](https://file+.vscode-resource.vscode-cdn.net/) 1] running stream sample.csv
2024-04-25 17:25:58 DBG Sling version: 1.2.6 (darwin arm64)
2024-04-25 17:25:58 DBG type is file-db
2024-04-25 17:25:58 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"fields_per_rec":-1,"compression":"AUTO","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"max_decimals":-1}
2024-04-25 17:25:58 DBG using target options: {"datetime_format":"auto","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-04-25 17:25:58 INF connecting to target database (postgres)
2024-04-25 17:25:58 DBG opened "postgres" connection (conn-postgres-muE)
2024-04-25 17:25:59 INF reading from source file system (gs)
2024-04-25 17:25:59 DBG reading datastream from gs://justinbricker-delete-whenever/sample.csv [format=csv]
2024-04-25 17:25:59 DBG merging csv readers of 1 files [concurrency=10] from gs://justinbricker-delete-whenever/sample.csv
2024-04-25 17:25:59 DBG processing reader from gs://justinbricker-delete-whenever/sample.csv
2024-04-25 17:25:59 DBG delimiter auto-detected: ","
2024-04-25 17:25:59 INF writing to target database [mode: full-refresh]
2024-04-25 17:25:59 DBG drop table if exists "public"."test_sling_tmp"
2024-04-25 17:26:00 DBG table "public"."test_sling_tmp" dropped
2024-04-25 17:26:00 DBG create table if not exists "public"."test_sling_tmp" ("department" text,
"title" text,
"title_level" text,
"display_in_app" bool,
"_sling_loaded_at" integer)
2024-04-25 17:26:00 INF streaming data
2024-04-25 17:26:00 DBG select count(*) cnt from "public"."test_sling_tmp"
2024-04-25 17:26:00 DBG comparing checksums ["department [text | text]","title [text | text]","title_level [text | text]","display_in_app [bool | boolean]","_sling_loaded_at [integer | integer]"]
2024-04-25 17:26:00 DBG select sum(length("department"::text)) as "department", sum(length("title"::text)) as "title", sum(length("title_level"::text)) as "title_level", sum(length("display_in_app"::text)) as "display_in_app", sum(abs("_sling_loaded_at")) as "_sling_loaded_at" from "public"."test_sling_tmp"
2024-04-25 17:26:00 DBG drop table if exists "public"."test_sling"
2024-04-25 17:26:01 DBG table "public"."test_sling" dropped
2024-04-25 17:26:01 DBG create table if not exists "public"."test_sling" ("department" text,
"title" text,
"title_level" text,
"display_in_app" bool,
"_sling_loaded_at" bigint)
2024-04-25 17:26:01 INF created table "public"."test_sling"
2024-04-25 17:26:01 DBG inserting _sling_loaded_at [integer] into _sling_loaded_at [bigint]
2024-04-25 17:26:01 DBG insert into "public"."test_sling" ("department", "title", "title_level", "display_in_app", "_sling_loaded_at") select "department", "title", "title_level", "display_in_app", "_sling_loaded_at" from "public"."test_sling_tmp"
2024-04-25 17:26:01 DBG inserted rows into "public"."test_sling" from temp table "public"."test_sling_tmp"
2024-04-25 17:26:01 INF inserted 8 rows into "public"."test_sling" in 2 secs [3 r/s]
2024-04-25 17:26:01 DBG drop table if exists "public"."test_sling_tmp"
2024-04-25 17:26:01 DBG table "public"."test_sling_tmp" dropped
2024-04-25 17:26:01 DBG closed "postgres" connection (conn-postgres-muE)
2024-04-25 17:26:01 INF execution succeeded

2024-04-25 17:26:01 INF Sling Replication Completed in 3s | TEST_JUSTIN_GCS_BUCKET -> TEST_TARGET__APP_DB_DEV | 1 Successes | 0 Failures
justinbricker-acuitymd commented 2 months ago

I double checked to be sure that the column isn't somehow in the csv file 😅, but alas it is not.

department,title,title_level,display_in_app
Administration,Administrator,Chief,false
Administration,CIO,Chief,false
Administration,CMO,Chief,false
Administration,COO,Chief,false
Administration,CEO,Chief,false
Administration,CFO,Chief,false
Administration,Buyer,Chief,true
Purchasing,Buyer,Staff,true
flarco commented 2 months ago

Yes, this has been the default behavior for file-to-db loads. Fixed for next release: https://github.com/slingdata-io/sling-cli/pull/274 Feel free to compile the binary yourself, or wait for release this weekend.

justinbricker-acuitymd commented 2 months ago

Fantastic, thanks for the good news