slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
299 stars 16 forks source link

Issue with wilcard use with either local storage or ftp #326

Closed kasander closed 2 days ago

kasander commented 1 week ago

Issue Description

source: 'local://'
target: SNOWTEST

defaults:
  mode: full-refresh

  source_options:
    format: csv
    empty_as_null: true    

  target_options:
    adjust_column_type: true    

streams:
  "file://C:/whi/*.csv":
    object: RAW.CatalogTest
    single: true
PS P:\dagster\testing> sling run -d -r test-local.yaml
2024-06-17 09:53:39 INF Sling Replication [1 streams] | local:// -> SNOWTEST

2024-06-17 09:53:39 INF [1 / 1] running stream file://C:/whi/*.csv
2024-06-17 09:53:39 DBG Sling version: 1.2.11 (windows amd64)
2024-06-17 09:53:39 DBG type is file-db
2024-06-17 09:53:39 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"fields_per_rec":-1,"compression":"auto","format":"csv","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"max_decimals":-1,"columns":{}}
2024-06-17 09:53:39 DBG using target options: {"datetime_format":"auto","file_max_rows":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":true,"column_casing":"source"}
2024-06-17 09:53:39 INF connecting to target database (snowflake)
2024-06-17 09:53:39 DBG opened "snowflake" connection (conn-snowflake-9Wg)
2024-06-17 09:53:40 INF reading from source file system (file)
2024-06-17 09:53:40 INF no files found
2024-06-17 09:53:40 INF execution succeeded

2024-06-17 09:53:40 INF Sling Replication Completed in 1s | local:// -> SNOWTEST | 1 Successes | 0 Failures

There's 4 csv files in that folder.

Now the FTP config, this one gives a fatal error but works fine if I give it a full filename, it breaks when using wildcard:

source: MY_FTP
target: SNOWTEST

defaults:
  mode: full-refresh

  source_options:
    format: csv
    empty_as_null: true    

  target_options:
    adjust_column_type: true    

streams:
  'test/*.csv':
    object: RAW.CatalogTest
    single: true

And the log output of the -d command:

PS P:\dagster\testing> sling run -d -r test.yaml
2024-06-17 11:20:45 INF Sling Replication [1 streams] | MY_FTP -> SNOWTEST

2024-06-17 11:20:45 INF [1 / 1] running stream test/*.csv
2024-06-17 11:20:45 DBG Sling version: 1.2.11 (windows amd64)
2024-06-17 11:20:45 DBG type is file-db
2024-06-17 11:20:45 DBG using source options: {"trim_space":false,"empty_as_null":true,"header":true,"fields_per_rec":-1,"compression":"auto","format":"csv","null_if":"NULL","datetime_format":"AUTO","skip_blank_lines":false,"max_decimals":-1,"columns":{}}
2024-06-17 11:20:45 DBG using target options: {"datetime_format":"auto","file_max_rows":0,"max_decimals":-1,"use_bulk":true,"add_new_columns":true,"adjust_column_type":true,"column_casing":"source"}
2024-06-17 11:20:45 INF connecting to target database (snowflake)
2024-06-17 11:20:45 DBG opened "snowflake" connection (conn-snowflake-aZ7)
2024-06-17 11:20:46 INF reading from source file system (ftp)
2024-06-17 11:20:46 DBG reading datastream from ftp://ftp.***.com/test/*.csv [format=csv]
2024-06-17 11:20:46 DBG merging csv readers of 1 files [concurrency=8] from ftp://ftp.***.com/test/*.csv
2024-06-17 11:20:46 DBG processing reader from ftp://ftp.***.com/test/*.csv
2024-06-17 11:20:46 INF execution failed

2024-06-17 11:20:47 INF ~ dataflow error while waiting for ready state
context canceled
2024-06-17 11:20:47 INF Sling Replication Completed in 1s | MY_FTP -> SNOWTEST | 0 Successes | 1 Failures

fatal:
--- proc.go:267 main ---
--- sling_cli.go:442 main ---
--- sling_cli.go:474 cliInit ---
--- cli.go:286 CliProcess ---
~ failure running replication (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_run.go:188 processRun ---

--------------------------- test/*.csv ---------------------------
~ execution failed
--- task_run.go:97 func1 ---
~ could not read from file
--- task_run.go:378 runFileToDB ---
~ Could not FileSysReadDataflow for ftp
--- task_run_read.go:251 ReadFromFile ---
~ error getting dataflow
--- fs.go:551 ReadDataflow ---
--- fs.go:993 GetDataflow ---
~ dataflow error while waiting for ready state
--- dataflow.go:616 WaitReady ---

A quick sample of the discover command on that connection, you can see there is files in there:

PS P:\dagster\testing> sling conns discover MY_FTP -p 'test/*.csv'  
+---+----------------------------------------+------+---------+------------------------------+
| # | NAME                                   | TYPE | SIZE    | LAST UPDATED (UTC)           |
+---+----------------------------------------+------+---------+------------------------------+
| 1 | test///WHI-Catalog2024-01-02-45317.csv | file | 6.2 MiB | 2024-06-17 13:40:58 (1h ago) |
| 2 | test///WHI-Catalog2024-01-03-45317.csv | file | 27 MiB  | 2024-06-17 13:40:59 (1h ago) |
| 3 | test///WHI-Catalog2024-01-04-45317.csv | file | 31 MiB  | 2024-06-17 13:40:59 (1h ago) |
| 4 | test//WHI-Catalog2024-01-01-45317.csv  | file | 502 KiB | 2024-06-17 13:40:58 (1h ago) |
+---+----------------------------------------+------+---------+------------------------------+
kristianandre commented 1 week ago

Seems related to my issue https://github.com/slingdata-io/sling-cli/issues/314. Weirdly enough, I was able to use wildcards locally but not in Azure Blob Storage.

flarco commented 1 week ago

Yeah, it's related. I know what's wrong, work in progress.

flarco commented 2 days ago

Should work in next release. Fixed by https://github.com/slingdata-io/sling-cli/pull/318/commits/6557b23b35700cc91d0ce1132aaf7e24e4c59525 Closing.