jwills / target-duckdb

A Singer.io target for DuckDB
Other
17 stars 12 forks source link

Bug: Keys are duplicated when data is imported using 'meltano elt SOURCE target-duckdb' #11

Closed matsonj closed 1 year ago

matsonj commented 1 year ago

Describe the bug When you execute 'meltano elt SOURCE target-duckdb', instead of only detecting incremental changes. It appears this is because "_smart_source_lineno" is getting populated with null.

To Reproduce Steps to reproduce the behavior:

  1. config any csv with tap-spreadsheets-anywhere.
  2. run elt with target-duckdb
  3. run it again
  4. the data will be added the table again.

Expected behavior only new records are added on step 3 when you run elt again. also, the column list should not change (meltano is running ALTER table to modify column names even when the source is unchanged in order to handle 'duplicate columns' for _smart_source_lineno".

Screenshots image

Your environment

matsonj commented 1 year ago

example tap config:

plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    config:
      tables:
      - path: file:///path/to/file
        name: test_load
        pattern: .*
        start_date: '2001-01-01T00:00:00Z'
        key_properties: [key]

test_load.csv

After running 'meltano elt tap-spreadsheets-anywhere target-duckdb' it will have 8 rows (should have 4). In my experience with identical config, this same behavior does not replicate with the postgres target. But I have not tried this explicitly.

matsonj commented 1 year ago

it is possible that this has to do with "key" enforced as required in target-duckdb. target-postgres will accept a blank key_property i.e.

key_properties: []
jwills commented 1 year ago

What happens if you give target-duckdb an empty key_properties? I bet this is related to this limitation https://github.com/jwills/target-duckdb/blob/main/target_duckdb/db_sync.py#L426 wherein I cannot modify primary key columns even if the modification is a no-op.

jwills commented 1 year ago

Okay I think I see how to fix this but it'll be a little tricky to do it

matsonj commented 1 year ago

Snippet here Primary key is set to mandatory but not defined in the [test_load] stream

full error below

2022-10-02T19:25:42.623374Z [info     ] Running extract & load...      name=meltano run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb
2022-10-02T19:25:42.785014Z [warning  ] No state was found, complete import.
2022-10-02T19:25:44.914039Z [info     ] INFO Using supplied catalog /home/jacob/meltano-projects/nba-monte-carlo/.meltano/run/elt/2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb/eb647e6b-b171-4f97-8986-b36e250bd1db/tap.properties.json. cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.915002Z [info     ] INFO Processing 3 selected streams from Catalog cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.915567Z [info     ] INFO Syncing stream:test_load  cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.916169Z [info     ] INFO Walking /mnt/c/Users/matso/OneDrive/Documents/test_data/test. cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.918400Z [info     ] time=2022-10-02 12:25:44 name=target_duckdb level=CRITICAL message=Primary key is set to mandatory but not defined in the [test_load] stream cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.919461Z [info     ] Traceback (most recent call last): cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.920324Z [info     ]   File "/home/jacob/meltano-projects/nba-monte-carlo/.meltano/loaders/target-duckdb/venv/bin/target-duckdb", line 8, in <module> cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.921014Z [info     ]     sys.exit(main())           cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.921725Z [info     ]   File "/home/jacob/meltano-projects/nba-monte-carlo/.meltano/loaders/target-duckdb/venv/lib/python3.8/site-packages/target_duckdb/__init__.py", line 424, in main cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.922548Z [info     ]     persist_lines(connection, config, singer_messages) cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.923898Z [info     ]   File "/home/jacob/meltano-projects/nba-monte-carlo/.meltano/loaders/target-duckdb/venv/lib/python3.8/site-packages/target_duckdb/__init__.py", line 279, in persist_lines cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.925466Z [info     ]     raise Exception("key_properties field is required") cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.926490Z [info     ] Exception: key_properties field is required cmd_type=loader name=target-duckdb run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.928226Z [info     ] INFO Found 1 files.            cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.942420Z [info     ] INFO Checking 1 resolved objects for any that match regular expression ".*" and were modified since 2001-01-01 00:00:00+00:00 cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.943156Z [info     ] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.943833Z [info     ] INFO Syncing file "test_load.csv". cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.960331Z [info     ] INFO Wrote 4 records for stream "test_load". cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.961139Z [info     ] INFO Syncing stream:schedule   cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.962463Z [info     ] INFO Walking /mnt/c/Users/matso/OneDrive/Documents/test_data/schedule. cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.975121Z [info     ] INFO Found 1 files.            cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.994512Z [info     ] INFO Checking 1 resolved objects for any that match regular expression ".*" and were modified since 2001-01-01 00:00:00+00:00 cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.995376Z [info     ] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:44.996258Z [info     ] INFO Syncing file "nba_schedule_2023.csv". cmd_type=extractor name=tap-spreadsheets-anywhere run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:45.021431Z [error    ] Loading failed                 code=1 message=Exception: key_properties field is required name=meltano run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb
2022-10-02T19:25:45.022673Z [info     ] ELT could not be completed: Loader failed. cmd_type=elt name=meltano run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:45.023162Z [info     ] For more detailed log messages re-run the command using 'meltano --log-level=debug ...' CLI flag. cmd_type=elt name=meltano run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:45.024320Z [info     ] Note that you can also check the generated log file at '/home/jacob/meltano-projects/nba-monte-carlo/.meltano/logs/elt/2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb/eb647e6b-b171-4f97-8986-b36e250bd1db/elt.log'. cmd_type=elt name=meltano run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
2022-10-02T19:25:45.025489Z [info     ] For more information on debugging and logging: https://docs.meltano.com/reference/command-line-interface#debugging cmd_type=elt name=meltano run_id=eb647e6b-b171-4f97-8986-b36e250bd1db state_id=2022-10-02T192539--tap-spreadsheets-anywhere--target-duckdb stdio=stderr
jwills commented 1 year ago

Ah okay- so it sounds like the upstream requires a key

jwills commented 1 year ago

Err, it's really this, b/c the ingest is config'd to be incremental: https://github.com/jwills/target-duckdb/blob/main/target_duckdb/__init__.py#L265

jwills commented 1 year ago

Okay I think I have a quick fix here, PR coming shortly