rocketlaunchr / dataframe-go

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Other
1.16k stars 93 forks source link

Error to read parquet with latest parquet-go #61

Open tanyaofei opened 2 years ago

tanyaofei commented 2 years ago

  1. Create a file with python pandas
    
    dataframe = pandas.DataFrame({
        "A": ["a", "b", "c", "d"],
        "B": [2, 3, 4, 1],
        "C": [10, 20, None, None]
    })

dataframe.to_parquet("1.parquet")

This file looks like: 
<img width="232" alt="image" src="https://user-images.githubusercontent.com/46375450/160764592-2c226509-c54b-48aa-853e-39ec7eb7412c.png">

2. Read this file
```go
func main() {
    ctx := context.Background()
    fr, _ := local.NewLocalFileReader("1.parquet")
    df, err := imports.LoadFromParquet(ctx, fr)
    if err != nil {
        panic(err)
    }
    fmt.Println(df)
}
  1. Got a unique name error
    
    panic: names of series must be unique: 

goroutine 1 [running]: github.com/rocketlaunchr/dataframe-go.NewDataFrame({0xc0001f8000, 0x3, 0xc000149a10?}) .../rocketlaunchr/dataframe-go@v0.0.0-20211025052708-a1030444159b/dataframe.go:41 +0x33c github.com/rocketlaunchr/dataframe-go/imports.LoadFromParquet({0x1497868, 0xc000020080}, {0x1498150?, 0xc00000e798?}, {0xc0000021a0?, 0xc000149f70?, 0x1007599?}) .../go/pkg/mod/github.com/rocketlaunchr/dataframe-go@v0.0.0-20211025052708-a1030444159b/imports/parquet.go:110 +0x8ae main.main() .../main.go:13 +0x78


4. Following the stack, I found some useful informations
+ All series in method `imports.LoadFromParquet` with empty names
<img width="807" alt="image" src="https://user-images.githubusercontent.com/46375450/160765546-8e578679-b4b1-47bf-8e2c-0333e7fd8e49.png">

+ goFieldNameToActual 
each keys in this map with prefix "Scheme", but `goName` didn't, may be it's the reason why can't not find a name from this map
<img width="345" alt="image" src="https://user-images.githubusercontent.com/46375450/160765786-a9bbedbe-5e52-4c0b-8065-b50d4be87b9d.png">
<img width="496" alt="image" src="https://user-images.githubusercontent.com/46375450/160766007-fb887901-6811-46a7-8f6e-248e1c4cacfd.png">

This's the first time I use golang to read parquet files. It is an error cause by parquet-go breaking changes or something else ?
pjebs commented 2 years ago

Can you send me the file

tanyaofei commented 2 years ago

Can you send me the file 1.parquet.zip

pjebs commented 2 years ago

Can you create the DataFrame from this package, export it to paraquet and then try and import it back?

tanyaofei commented 2 years ago

Can you create the DataFrame from this package, export it to paraquet and then try and import it back?

I tried it at the first time, it seems like a error parquet file with content "PAR1"

func main() {
    df := dataframe.NewDataFrame(dataframe.NewSeriesString("A", nil, []string{"1", "2", "3"}))
    file, _ := os.Create("1.parquet")
    _ = exports.ExportToParquet(context.Background(), file, df)
}
image
pjebs commented 2 years ago

A Parquet file is not text based. Can you try importing the file back.

tanyaofei commented 2 years ago

A Parquet file is not text based. Can you try importing the file back.

    df := dataframe.NewDataFrame(dataframe.NewSeriesString("A", nil, []string{"1", "2", "3"}))
    file, _ := os.Create("1.parquet")
    _ = exports.ExportToParquet(context.Background(), file, df)

    fr, _ := local.NewLocalFileReader("1.parquet")
    df, err := imports.LoadFromParquet(context.Background(), fr)
    if err != nil {
        panic(err)
    }
    fmt.Println(df)
panic: seek 1.parquet: invalid argument

goroutine 1 [running]:
main.main()
        .../main.go:21 +0x465
Exiting.

Error at imports/parquet.go, line 40: pr, err := reader.NewParquetReader(src, nil, int64(runtime.NumCPU()))

tanyaofei commented 2 years ago

A Parquet file is not text based. Can you try importing the file back.

My parquet-go version is v1.6.2: github.com/xitongsys/parquet-go v1.6.2

pjebs commented 2 years ago

I tried opening your file and it worked:

package main

import  "github.com/xitongsys/parquet-go-source/local"
import  "github.com/rocketlaunchr/dataframe-go/imports"
import "fmt"
import "context"

var ctx = context.Background()

func main() {
    fr, _ := local.NewLocalFileReader("1.parquet")
    defer fr.Close()

    df, err := imports.LoadFromParquet(ctx, fr)
    if err != nil {
        panic(err)
    }

    fmt.Println(df)
}

OUTPUT:

+-----+--------+-------+---------+
|     |   A    |   B   |    C    |
+-----+--------+-------+---------+
| 0:  |   a    |   2   |   10    |
| 1:  |   b    |   3   |   20    |
| 2:  |   c    |   4   |   NaN   |
| 3:  |   d    |   1   |   NaN   |
+-----+--------+-------+---------+
| 4X3 | STRING | INT64 | FLOAT64 |
+-----+--------+-------+---------+
tanyaofei commented 2 years ago

I tried opening your file and it worked:

package main

import    "github.com/xitongsys/parquet-go-source/local"
import    "github.com/rocketlaunchr/dataframe-go/imports"
import "fmt"
import "context"

var ctx = context.Background()

func main() {
  fr, _ := local.NewLocalFileReader("1.parquet")
  defer fr.Close()

  df, err := imports.LoadFromParquet(ctx, fr)
  if err != nil {
      panic(err)
  }

  fmt.Println(df)
}

OUTPUT:

+-----+--------+-------+---------+
|     |   A    |   B   |    C    |
+-----+--------+-------+---------+
| 0:  |   a    |   2   |   10    |
| 1:  |   b    |   3   |   20    |
| 2:  |   c    |   4   |   NaN   |
| 3:  |   d    |   1   |   NaN   |
+-----+--------+-------+---------+
| 4X3 | STRING | INT64 | FLOAT64 |
+-----+--------+-------+---------+

Can you tell me your parquet-go version ?

pjebs commented 2 years ago
module main

go 1.18

require (
    github.com/rocketlaunchr/dataframe-go v0.0.0-00010101000000-000000000000
    github.com/xitongsys/parquet-go-source v0.0.0-20200509081216-8db33acb0acf
)

require (
    github.com/apache/thrift v0.0.0-20181112125854-24918abba929 // indirect
    github.com/goccy/go-json v0.7.6 // indirect
    github.com/golang/snappy v0.0.0-20180518054509-2e65f85255db // indirect
    github.com/google/go-cmp v0.4.0 // indirect
    github.com/guptarohit/asciigraph v0.5.1 // indirect
    github.com/juju/clock v0.0.0-20190205081909-9c5c9712527c // indirect
    github.com/juju/errors v0.0.0-20200330140219-3fe23663418f // indirect
    github.com/juju/loggo v0.0.0-20200526014432-9ce3a2e09b5e // indirect
    github.com/juju/utils/v2 v2.0.0-20200923005554-4646bfea2ef1 // indirect
    github.com/klauspost/compress v1.9.7 // indirect
    github.com/mattn/go-runewidth v0.0.7 // indirect
    github.com/olekukonko/tablewriter v0.0.4 // indirect
    github.com/rocketlaunchr/mysql-go v1.1.3 // indirect
    github.com/xitongsys/parquet-go v1.5.2 // indirect
    golang.org/x/crypto v0.0.0-20200820211705-5c72a883971a // indirect
    golang.org/x/exp v0.0.0-20200331195152-e8c3332aa8e5 // indirect
    golang.org/x/net v0.0.0-20200904194848-62affa334b73 // indirect
    golang.org/x/sync v0.0.0-20200317015054-43a5402ce75a // indirect
    gopkg.in/yaml.v2 v2.3.0 // indirect
)
tanyaofei commented 2 years ago

I use github.com/apache/thrift v0.0.0-20181112125854-24918abba929, github.com/xitongsys/parquet-go v1.5.2 and it works.

pjebs commented 2 years ago

In the release notes:

[v1.6.0](https://github.com/xitongsys/parquet-go/releases/tag/v1.6.0)
Big changes in the type. Not compatiable with before.

I may need to update package to use 1.6+ instead of 1.5.

No idea why it is not using v1.5 for you since it's registered in the go.mod file.

tanyaofei commented 2 years ago

In the release notes:

[v1.6.0](https://github.com/xitongsys/parquet-go/releases/tag/v1.6.0)
Big changes in the type. Not compatiable with before.

I may need to update package to use 1.6+ instead of 1.5.

No idea why it is not using v1.5 for you since it's registered in the go.mod file.

v1.5 works find, may be i installed parquet-go before installed dataframe-go, not sure about it.

tanyaofei commented 2 years ago

It seems the problem solved, I should close this issue

pjebs commented 2 years ago

Maybe you directly imported "github.com/rocketlaunchr/dataframe-go/imports" without importing "github.com/rocketlaunchr/dataframe-go". Since there is no go.mod file inside github.com/rocketlaunchr/dataframe-go/imports directory, it just downloaded and used the latest version of parquet-go

tanyaofei commented 2 years ago

Maybe you directly imported "github.com/rocketlaunchr/dataframe-go/imports" without importing "github.com/rocketlaunchr/dataframe-go". Since there is no go.mod file inside github.com/rocketlaunchr/dataframe-go/imports directory, it just downloaded and used the latest version of parquet-go

Here is my shell records

➜  go get -u github.com/rocketlaunchr/dataframe-go
go: downloading github.com/rocketlaunchr/dataframe-go v0.0.0-20211025052708-a1030444159b
go: downloading golang.org/x/exp v0.0.0-20200331195152-e8c3332aa8e5
go: downloading github.com/google/go-cmp v0.4.0
go: downloading github.com/guptarohit/asciigraph v0.5.1
go: downloading github.com/olekukonko/tablewriter v0.0.4
go: downloading golang.org/x/sync v0.0.0-20200317015054-43a5402ce75a
go: downloading github.com/olekukonko/tablewriter v0.0.5
go: downloading github.com/google/go-cmp v0.5.7
go: downloading github.com/mattn/go-runewidth v0.0.7
go: downloading github.com/mattn/go-runewidth v0.0.13
go: downloading golang.org/x/exp v0.0.0-20220328175248-053ad81199eb
go: downloading github.com/guptarohit/asciigraph v0.5.3
go: downloading github.com/rivo/uniseg v0.2.0
go: added github.com/google/go-cmp v0.5.7
go: added github.com/guptarohit/asciigraph v0.5.3
go: added github.com/mattn/go-runewidth v0.0.13
go: added github.com/olekukonko/tablewriter v0.0.5
go: added github.com/rivo/uniseg v0.2.0
go: added github.com/rocketlaunchr/dataframe-go v0.0.0-20211025052708-a1030444159b
go: added golang.org/x/exp v0.0.0-20220328175248-053ad81199eb
go: added golang.org/x/sync v0.0.0-20210220032951-036812b2e83c
➜  go get -u github.com/xitongsys/parquet-go/parquet                                     
go: downloading github.com/apache/thrift v0.16.0
go: upgraded github.com/apache/thrift v0.0.0-20181112125854-24918abba929 => v0.16.0
go: upgraded github.com/xitongsys/parquet-go v1.5.2 => v1.6.2
➜  go get -u github.com/xitongsys/parquet-go-source                                       
go: downloading github.com/xitongsys/parquet-go-source v0.0.0-20220315005136-aec0fe3e777c
go: upgraded github.com/xitongsys/parquet-go-source v0.0.0-20200817004010-026bad9b25d0 => v0.0.0-20220315005136-aec0fe3e777c
pjebs commented 2 years ago

You shouldn't have done the last 2 go gets since they don't have a go.mod file so it just assumed the latest version hence: go: upgraded github.com/xitongsys/parquet-go v1.5.2 => v1.6.2

pjebs commented 2 years ago

From Go's point of view, when you do that, it's an unrelated package.

tanyaofei commented 2 years ago

You shouldn't have done the last 2 go gets since they don't have a go.mod file so it just assumed the latest version hence: go: upgraded github.com/xitongsys/parquet-go v1.5.2 => v1.6.2

Get it, thanks a lot

chippyash commented 2 years ago

Hi - when is this lib going to be upgraded to use >= V1.6.2 of parquet-go please? having to fix on v1.5.4 just broke all the tagging I was using which assumed V1.6.2 :-(

pjebs commented 2 years ago

There is a backward-incompatible change in v1.6.2. Therefore I will need to explore it more deeply.

This package's go.mod is set to github.com/xitongsys/parquet-go v1.5.2 so it should work for you provided you don't try and indepdently go get the "github.com/rocketlaunchr/dataframe-go/imports" package.

Let the main package dictate the dependencies for the sub-packages.