matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.76k stars 273 forks source link

[Bug]: load parquet file reported page has repetition or definition not support is not yet implemented #15606

Open heni02 opened 4 months ago

heni02 commented 4 months ago

Is there an existing issue for the same bug?

Branch Name

main

Commit ID

43ed769a01e02a59f4f7b1b48face8a0d6699e2a

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

mysql> load data infile {'filepath'='/Users/heni/Downloads/price.parquet','format'='parquet'} into table pg_weather; ERROR 20102 (HY000): page has repetition or definition not support is not yet implemented

ddl: create table pg_weather(price bigint,area bigint,bedrooms bigint,bathrooms bigint,stories bigint,mainroad varchar(25),guestroom varchar(25),basement varchar(50),hotwaterheating varchar(50),airconditioning varchar(50),parking bigint,prefarea varchar(50),furnishingstatus varchar(50));

price.parquet: parquet.tar.gz

price.parquet file schema: price: int64 area: int64 bedrooms: int64 bathrooms: int64 stories: int64 mainroad: string guestroom: string basement: string hotwaterheating: string airconditioning: string parking: int64 prefarea: string furnishingstatus: string

Expected Behavior

No response

Steps to Reproduce

create table pg_weather(price bigint,area bigint,bedrooms bigint,bathrooms bigint,stories bigint,mainroad varchar(25),guestroom varchar(25),basement varchar(50),hotwaterheating varchar(50),airconditioning varchar(50),parking bigint,prefarea varchar(50),furnishingstatus varchar(50));

load data infile {'filepath'='/Users/heni/Downloads/price.parquet','format'='parquet'} into table pg_weather;

Additional information

No response

heni02 commented 4 months ago

另外一个case,麻烦在定位下原因 @forsaken628 mysql> create table parquet_05(model varchar(50),mpg double, cyl int,disp double,hp int,drat double,wt double,qsec double,vs int,am int,gear int,carb int); Query OK, 0 rows affected (0.03 sec)

mysql> load data infile {'filepath'='/Users/heni/test_data/parquet_data/mt_cars.parquet','format'='parquet'} into table parquet_05; ERROR 20102 (HY000): page has repetition or definition not support is not yet implemented

mt_cars.parquet ddl: model: string mpg: double cyl: int32 disp: double hp: int32 drat: double wt: double qsec: double vs: int32 am: int32 gear: int32 carb: int32

mt_cars.parquet: car.tar.gz

heni02 commented 4 months ago

和楠哥沟通,1.2无法完成,先挪到1.3-backlog

heni02 commented 4 months ago

commit:dc23dd174a336a248a1d3544494ac1b3a31c40b9 mysql> create table pg_weather(price bigint,area bigint,bedrooms bigint,bathrooms bigint,stories bigint,mainroad varchar(25),guestroom varchar(25),basement varchar(50),hotwaterheating varchar(50),airconditioning varchar(50),parking bigint,prefarea varchar(50),furnishingstatus varchar(50)); Query OK, 0 rows affected (0.04 sec)

mysql> load data infile {'filepath'='/Users/heni/Downloads/price.parquet','format'='parquet'} into table pg_weather; ERROR 20101 (HY000): internal error: panic cannot convert values of type INT32 to type BYTE_ARRAY: github.com/parquet-go/parquet-go/encoding.(Values).assertKind /Users/heni/go/pkg/mod/github.com/parquet-go/parquet-go@v0.20.1/encoding/values.go:56 github.com/parquet-go/parquet-go/encoding.(Values).ByteArray /Users/heni/go/pkg/mod/github.com/parquet-go/parquet-go@v0.20.1/encoding/values.go:109 github.com/matrixorigin/matrixone/pkg/sql/colexec/external.(*ParquetHandler).getMapper.func12 /Users/heni/test-envir/matrixone/pkg/

forsaken628 commented 4 months ago

File price.parquet is incorrect, although the current implementation will not trigger it

package main

import (
    "bytes"
    "fmt"
    "io"
    "os"

    "github.com/parquet-go/parquet-go"
    "github.com/parquet-go/parquet-go/format"
    "github.com/segmentio/encoding/thrift"
)

func main() {
    bs, err := os.ReadFile("price.parquet")
    if err != nil {
        panic(err)
    }

    buf := bytes.NewReader(bs)
    f, err := parquet.OpenFile(buf, buf.Size())
    if err != nil {
        panic(err)
    }

    col := f.Root().Column("mainroad")
    // col := f.Root().Column("furnishingstatus")
    colIdx := col.Index()

    // read file footer, it's just thrift unmarshal, so it's hardly to have a bug
    // https://github.com/apache/parquet-format/blob/master/README.md#file-format
    meta := f.Metadata().RowGroups[0].Columns[colIdx].MetaData
    fmt.Println(meta.DictionaryPageOffset, meta.DataPageOffset)
    // https://github.com/apache/parquet-format/blob/master/README.md#column-chunks
    // The dictionary page must be placed at the first position of the column chunk.
    // meta.DictionaryPageOffset 5372 offset of dict page
    // meta.DataPageOffset 5387 offset of first data page

    comp := thrift.CompactProtocol{}
    var de thrift.Decoder
    de.Reset(comp.NewReader(buf))

    buf.Seek(meta.DictionaryPageOffset, io.SeekStart)

    header := new(format.PageHeader)
    err = de.Decode(header)
    if err != nil {
        panic(err)
    }
    fmt.Printf("%+v\n", header)
    // Successfully read dict page header
    // &{Type:DICTIONARY_PAGE UncompressedPageSize:13 CompressedPageSize:15 CRC:0 ... }

    // skip page body
    buf.Seek(int64(header.CompressedPageSize), io.SeekCurrent)

    cur, _ := buf.Seek(0, io.SeekCurrent)
    fmt.Println(cur)
    // 5400
    // here is the second page, also the first data page, but 5400 != 5387, so this file is invalid. 5372+15=5387, 15 is dict page body size, I think this is why the bug to appear.

    header = new(format.PageHeader)
    // Successfully read data page header
    err = de.Decode(header)
    if err != nil {
        panic(err)
    }
    fmt.Printf("%+v\n", header)
    // &{Type:DATA_PAGE UncompressedPageSize:239 CompressedPageSize:171 CRC:0 ... }
}