starlake-ai / starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
http://starlake.ai/
Apache License 2.0
57 stars 22 forks source link

[BUG] - Invalid expected columns in load unit testing for timestamp #889

Closed tiboun closed 1 month ago

tiboun commented 6 months ago

Given the load schema

# metadata/load/starbake/_config.sl.yml
---
version: 1
load:
  name: "starbake"
  metadata:
    directory: "{{incoming_path}}/starbake"
# metadata/load/starbake/orders.sl.yml
---
version: 1
table:
  name: "orders"
  pattern: "order.*_(?<mode>F|D).json"
  attributes:
  - name: "customer_id"
    type: "long"
    array: false
  - name: "order_id"
    type: "long"
    array: false
  - name: "status"
    type: "string"
    array: false
  - name: "timestamp"
    type: "iso_date_time"
    array: false
  metadata:
    format: "JSON"
    array: true
    withHeader: true
    writeStrategy:
      types:
        OVERWRITE: 'group("mode") == "F"'
        UPSERT_BY_KEY: 'group("mode") == "D"'
      key:
      - "order_id"

and the given input data to load:

// metadata/tests/load/starbake/orders/init/orders_F.json and _expected.json
[
    {
        "order_id" : 1,
        "customer_id" : 6,
        "timestamp" : "2024-02-05T21:19:15.454Z",
        "status" : "Cancelled"
    },
    {
        "order_id" : 2,
        "customer_id" : 23,
        "timestamp" : "2024-01-02T10:44:37.590Z",
        "status" : "Pending"
    },
    {
        "order_id" : 3,
        "customer_id" : 20,
        "timestamp" : "2024-02-10T22:10:30.685Z",
        "status" : "Delivered"
    }
]

We have as test output:


Unexpected Records
customer_id,order_id,status,timestamp
20,3,Delivered,2024-02-10 23:10:30.685
23,2,Pending,2024-01-02 11:44:37.59
6,1,Cancelled,2024-02-05 22:19:15.454

Missing Records
order_id,customer_id,timestamp,status
1,6,2024-02-05T21:19:15.454Z,Cancelled
2,23,2024-01-02T10:44:37.590Z,Pending
3,20,2024-02-10T22:10:30.685Z,Delivered

We can see that input schema is:

image

and expected schema is :

image

I think, comparing string with timestamp generate this output result and might be related to how expected data is loaded into duckdb.

hayssams commented 6 months ago

Done Please update first your metadata/types/default.sl.yml file fro src/main/resources/types/default.sl.yml