slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
299 stars 16 forks source link

Adding `ignore_existing: true` fails the execution when target doesn't exist #301

Closed dduong1603 closed 1 month ago

dduong1603 commented 1 month ago

Issue Description

source: S3
target: SFTP
streams:
  {s3_prefix}/{file_prefix}_<redacted>_*.csv:
    object: '{folder_path}/{stream_file_name}.csv'
    target_options:
      format: csv
      ignore_existing: true
env:
  s3_prefix: <redacted>
  file_prefix: <redacted>
  folder_path: <redacted>
2024-05-23 17:43:30 INF [1 / 4] running stream s3://<redacted>/<redacted>.csv
2024-05-23 17:43:30 DBG Sling version: 1.2.10 (linux amd64)
2024-05-23 17:43:30 DBG type is file-file
2024-05-23 17:43:30 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"fields_per_rec":-1,"compression":"auto","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"max_decimals":-1}
2024-05-23 17:43:30 DBG using target options: {"header":true,"compression":"auto","concurrency":7,"datetime_format":"auto","delimiter":",","file_max_rows":0,"file_max_bytes":0,"format":"csv","max_decimals":-1,"use_bulk":true,"ignore_existing":true,"add_new_columns":true,"adjust_column_type":false,"column_casing":"source"}
2024-05-23 17:43:30 INF reading from source file system (s3)
2024-05-23 17:43:30 DBG reading datastream from s3://<redacted>/<redacted>.csv [format=csv]
2024-05-23 17:43:30 DBG merging csv readers of 1 files [concurrency=10] from s3://<redacted>/<redacted>.csv
2024-05-23 17:43:30 DBG processing reader from s3://<redacted>/<redacted>.csv
2024-05-23 17:43:30 DBG delimiter auto-detected: ","
2024-05-23 17:43:30 INF writing to target file system (sftp)
2024-05-23 17:43:30 INF execution failed
2024-05-23 17:43:30 INF ~ error listing path: "/<redacted>/<redacted>.csv"
file does not exist

Previously in 1.2.9 (or if I remove the ignore_existing: true in 1.2.10) this would continue with

2024-05-23 17:58:42 WRN could not delete path sftp://<redacted>/<redacted>.csv
~ error listing path: "/<redacted>/<redacted>.csv"
file does not exist
2024-05-23 17:58:42 DBG writing to sftp://<redacted>/<redacted>.csv [fileRowLimit=0 fileBytesLimit=0 compression=auto concurrency=7 useBufferedStream=false fileFormat=csv]
2024-05-23 17:58:43 DBG wrote 14 kB: 67 rows [947 r/s]
2024-05-23 17:58:43  wrote 67 rows to sftp://<redacted>/<redacted>.csv in 0 secs [947 r/s]
2024-05-23 17:58:43  execution succeeded

Or if the target actually exists, then sling will happily skip the file and mark the execution as succeeded

2024-05-23 18:14:37 DBG not writing since file/folder exists at sftp://<redacted>/<redacted>.csv (ignore_existing=true)
2024-05-23 18:14:37 DBG wrote 0 B: 0 rows [0 r/s]
2024-05-23 18:14:37  wrote 0 rows to sftp://<redacted>/<redacted>.csv in 0 secs [0 r/s]
2024-05-23 18:14:37  execution succeeded
flarco commented 1 month ago

Should be fixed in https://github.com/slingdata-io/sling-cli/pull/303 with https://github.com/slingdata-io/sling-cli/pull/303/commits/4131cb5f0a9cfaa6c18d1b1b7009ac3d0d9c3689 Feel free to compile binary and test. Closing for now.