kamu-data / kamu-cli

Next-generation decentralized data lakehouse and a multi-party stream processing network
https://kamu.dev
Other
293 stars 12 forks source link

Struct not supported in nested data #692

Open onyalcin opened 2 months ago

onyalcin commented 2 months ago

I hit an issue while trying to pull a nested data source structured as follows:

{
    "starred_at": "2020-07-05T01:34:55Z",
    "user": {
      "login": "onyalcin",
      "id": 7300802
    }
  }

Received the following error, showing STRUCT is not supported: Failed to pull com.github.kamu-cli.stargazers: 0: Internal error 1: This feature is not implemented: Unsupported SQL type Custom(ObjectName([Ident { value: "STRUCT", quote_style: None }]), ["id", "BIGINT", "login", "STRING"])

Currently we need to add a preprocessing step with jq to handle this, which is too complex. Can we support this feature during the read phase in DataFusion?

Below is the dataset definition I worked with:

kind: DatasetSnapshot
version: 1
content:
  name: com.github.kamu-cli.stargazers
  kind: Root
  metadata:
    - kind: SetPollingSource
      fetch:
        kind: Url
        url: https://api.github.com/repos/kamu-data/kamu-cli/stargazers
        headers:
          - name: User-Agent
            value: kamu
          - name: Accept
            value: application/vnd.github.star+json
      read:
        kind: Json
        schema:
          - starred_at TIMESTAMP
          - user STRUCT(id BIGINT, login STRING)
      preprocess:
        kind: Sql
        engine: datafusion
        query: |
          SELECT
            starred_at as event_time,
            user.id as user_id,
            user.login as user_name
          FROM input
      merge:
        kind: Snapshot
        primaryKey:
          - event_time
          - user_id
    - kind: SetInfo
      description: Stars of the selected github repository.
sergiimk commented 1 month ago

Note that when using EthereumLogs source it is possible to end up with a root dataset containing nested data. Until we fully support nested structs we should detect this and add a nice warning.