tableau / hyper-api-samples

Sample code to get started with the Hyper API.
https://help.tableau.com/current/api/hyper_api/en-us/index.html
MIT License
132 stars 71 forks source link

why query this parquet file reports Scanning of nested columns in Parquet files is disabled? #102

Closed l1t1 closed 1 year ago

l1t1 commented 1 year ago

data: https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet 7,134,977,202 bytes I use this python script from https://github.com/tableau/hyper-api-samples/pull/78 and modify it to show timer

#!/usr/bin/env python3
import readline
import time
from argparse import ArgumentParser
from tableauhyperapi import HyperProcess, Connection, Telemetry, CreateMode, HyperException
# hyperapi-cli
## An interactive HyperAPI SQL cli

##This script allows you to interactively execute SQL commands via HyperAPI.

## Usage
##bash
##./hyperapi-cli.py [optional hyper database file]
##

def main():
    parser = ArgumentParser("HyperAPI interactive cli.")
    parser.add_argument("database", type=str, nargs='?',
                        help="A Hyper file to attach on startup")

    args = parser.parse_args()
    create_mode = CreateMode.CREATE_IF_NOT_EXISTS if args.database else CreateMode.NONE

    with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper_process:
        try:
            with Connection(hyper_process.endpoint, args.database, create_mode) as connection:
                while True:
                    try:
                        sql = input("> ")
                    except (EOFError, KeyboardInterrupt):
                        return
                    try:
                        t=time.time()
                        with connection.execute_query(sql) as result:
                            print("\t".join(str(column.name)
                                  for column in result.schema.columns))
                            for row in result:
                                print("\t".join(str(column) for column in row))
                        print(round(time.time()-t,3),"s\n")
                    except HyperException as exception:
                        print(f"Error executing SQL: {exception}")
        except HyperException as exception:
            print(f"Unable to connect to the database: {exception}")

if __name__ == "__main__":
    main()

query result

> select count(*) from external('./hacknernews.parquet');
"count"
28737557
0.779 s

> select * from external('./hacknernews.parquet') limit 1;
Error executing SQL: Scanning of nested columns in Parquet files is disabled.
Hint: Do not select group column kids when scanning the file
Context: 0xfa6b0e2f

duckdb can select * the same file

D describe select * from 'd:/hacknernews.parquet';
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ BIGINT      │ YES     │         │         │         │
│ deleted     │ UTINYINT    │ YES     │         │         │         │
│ type        │ BLOB        │ YES     │         │         │         │
│ by          │ BLOB        │ YES     │         │         │         │
│ time        │ BIGINT      │ YES     │         │         │         │
│ text        │ BLOB        │ YES     │         │         │         │
│ dead        │ UTINYINT    │ YES     │         │         │         │
│ parent      │ BIGINT      │ YES     │         │         │         │
│ poll        │ BIGINT      │ YES     │         │         │         │
│ kids        │ BIGINT[]    │ YES     │         │         │         │
│ url         │ BLOB        │ YES     │         │         │         │
│ score       │ INTEGER     │ YES     │         │         │         │
│ title       │ BLOB        │ YES     │         │         │         │
│ parts       │ BIGINT[]    │ YES     │         │         │         │
│ descendants │ INTEGER     │ YES     │         │         │         │
├─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 15 rows                                                 6 columns │
└───────────────────────────────────────────────────────────────────┘
l1t1 commented 1 year ago

I got it from https://tableau.github.io/hyper-db/docs/sql/external/formats#external-format-parquet Nested columns and therefore the nested types MAP and LIST are not supported.