rocketlaunchr / dataframe-go

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Other
1.16k stars 93 forks source link

- Address issue: https://github.com/rocketlaunchr/dataframe-go/issues… #63

Open pjebs opened 2 years ago

pjebs commented 2 years ago

…/62 (Chinese characters and python BOM prefix)

tanyaofei commented 2 years ago

No problem reading the file with encoding UTF-8-BOM. And No errors exporting to parquet, but can NOT read back the parquet file.

func TestUTF8CSV(t *testing.T) {
    fr, err := os.Open("export.csv")
    if err != nil {
        panic(err)
    }

    df, err := imports.LoadFromCSV(context.Background(), fr)
    if err != nil {
        panic(err)
    }

    out, err := os.Create("export.parquet")
    if err != nil {
        panic(err)
    }

    err = exports.ExportToParquet(context.Background(), out, df)
    if err != nil {
        panic(err)
    }

    out.Close()

    fr, err = os.Open("export.parquet")
    source, err := local.NewLocalFileReader("export.parquet")
    if err != nil {
        panic(err)
    }
    df, err = imports.LoadFromParquet(context.Background(), source)
    if err != nil {
        panic(err)
    }
    fmt.Println(df)

}
=== RUN   TestUTF8CSV
--- FAIL: TestUTF8CSV (0.02s)
panic: [NextRowGroup] Column not found: Parquet_go_root.P_231188150229143183 [recovered]
    panic: [NextRowGroup] Column not found: Parquet_go_root.P_231188150229143183

goroutine 14 [running]:
testing.tRunner.func1.2({0x13b8d60, 0xc0006af1c0})
    /usr/local/opt/go/libexec/src/testing/testing.go:1389 +0x24e
testing.tRunner.func1()
    /usr/local/opt/go/libexec/src/testing/testing.go:1392 +0x39f
panic({0x13b8d60, 0xc0006af1c0})
    /usr/local/opt/go/libexec/src/runtime/panic.go:838 +0x207
github.com/rocketlaunchr/dataframe-go/aa.TestUTF8CSV(0x0?)
    .../dataframe-go/aa/utf8_csv_test.go:43 +0x1d7
testing.tRunner(0xc0005c9d40, 0x1444b28)
    /usr/local/opt/go/libexec/src/testing/testing.go:1439 +0x102
created by testing.(*T).Run
    /usr/local/opt/go/libexec/src/testing/testing.go:1486 +0x35f
tanyaofei commented 2 years ago

export.parquet.zip

pjebs commented 2 years ago

Can you read it back in python to check if the output file is valid?

tanyaofei commented 2 years ago

Can you read it back if python?

I don't think so, cause idea plugin Big Data Tools show Nothing to show

and here is my python scripts out:

       编号    年龄    性别    地区  身高cm  体重kg  ... 吃零食情况  跑步情况 玩电脑游戏情况  逛街情况  散步情况  夜宵情况
0    None  None  None  None  None  None  ...  None  None    None  None  None  None
1    None  None  None  None  None  None  ...  None  None    None  None  None  None
2    None  None  None  None  None  None  ...  None  None    None  None  None  None
3    None  None  None  None  None  None  ...  None  None    None  None  None  None
4    None  None  None  None  None  None  ...  None  None    None  None  None  None
..    ...   ...   ...   ...   ...   ...  ...   ...   ...     ...   ...   ...   ...
446  None  None  None  None  None  None  ...  None  None    None  None  None  None
447  None  None  None  None  None  None  ...  None  None    None  None  None  None
448  None  None  None  None  None  None  ...  None  None    None  None  None  None
449  None  None  None  None  None  None  ...  None  None    None  None  None  None
450  None  None  None  None  None  None  ...  None  None    None  None  None  None

[451 rows x 21 columns]
pjebs commented 2 years ago

I wonder when you used the pull-request branch, it is using the latest (incompatible) version of the parquet parsing package?

tanyaofei commented 2 years ago

I wonder when you used the pull-request branch, it is using the latest (incompatible) version of the parquet parsing package?

I am sure I am using github.com/xitongsys/parquet-go v1.5.2 and github.com/xitongsys/parquet-go-source v0.0.0-20200509081216-8db33acb0acf

pjebs commented 2 years ago

When you tried s.Rename("X" + strings.Trim(s.Name(), "\xEF\xBB\xBF")), could you read the exported parquet file back in python?