Closed space55 closed 3 years ago
I've had lots of people in the past people asking for exporting to parquet, which I implemented. You're the first to ask about importing, but I had put it in my todo list in may. I won't have time to implement it soon. However, you can issue as PR.
Hmmm. I noticed in my TODO list (https://github.com/rocketlaunchr/dataframe-go/issues/17), there had been 3 thumbs up for that request.
In case it helps, here is some code I wrote to read a parquet file into a DataFrame that you may be able to adapt in the meantime:
package main
import (
dataframe "github.com/rocketlaunchr/dataframe-go"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/reader"
"context"
"runtime"
)
func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
// create local parquet/reader instances
entriesFr, err := local.NewLocalFileReader(inputParquet)
if err != nil {
log.Println("Can't open file")
}
entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))
if err != nil {
log.Println("Unable to create parquet reader", err)
}
// determine numer of rows in input parquet file
numRows := int64(entriesPr.GetNumRows())
// read columns from parquet and use them to construct a DataFrame instance of the
// same form
var paths, titles, bodies, accesscounts, accessdates, createddates, deadlines, priorities, archived []interface{}
paths, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.path", numRows)
titles, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.title", numRows)
bodies, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.body", numRows)
accesscounts, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.accesscount", numRows)
accessdates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.lastaccess", numRows)
createddates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.datecreated", numRows)
deadlines, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.deadline", numRows)
priorities, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.priority", numRows)
archived, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.archived", numRows)
entries := dataframe.NewDataFrame(
dataframe.NewSeriesString("path", nil, paths...),
dataframe.NewSeriesString("title", nil, titles...),
dataframe.NewSeriesString("body", nil, bodies...),
dataframe.NewSeriesInt64("accesscount", nil, accesscounts...),
dataframe.NewSeriesInt64("lastaccess", nil, accessdates...),
dataframe.NewSeriesInt64("datecreated", nil, createddates...),
dataframe.NewSeriesInt64("deadline", nil, deadlines...),
dataframe.NewSeriesInt64("priority", nil, priorities...),
dataframe.NewSeriesInt64("archived", nil, archived...),
)
entriesPr.ReadStop()
entriesFr.Close()
// sort entries by date of creation
sortKey := []dataframe.SortKey{
{Key: "datecreated", Desc: true},
}
ctx := context.Background()
entries.Sort(ctx, sortKey)
return entries
}
Few comments:
Cheers.
Thanks @khughitt . I need to generalise it so that it works for anything parquet data.
@pjebs Did you manage to generalise it? Can't get it to work, getting this error.
@CeciliaCoelho can you show me your code.
I was actually waiting for a response to these Qs: https://github.com/xitongsys/parquet-go/issues/360
@CeciliaCoelho can you show me your code.
I was actually waiting for a response to these Qs: xitongsys/parquet-go#360
Getting a new error now. This was a CSV that I converted to parquet using python but wanted to open and use in Go because of efficiency. The CSV was like this:
I have this code:
package main
import (
"context"
"log"
"runtime"
dataframe "github.com/rocketlaunchr/dataframe-go"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/reader"
)
func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
// create local parquet/reader instances
entriesFr, err := local.NewLocalFileReader(inputParquet)
if err != nil {
log.Println("Can't open file")
}
entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))
if err != nil {
log.Println("Unable to create parquet reader", err)
}
// determine numer of rows in input parquet file
numRows := int64(entriesPr.GetNumRows())
// read columns from parquet and use them to construct a DataFrame instance of the
// same form
var id, name, res, spill, turb, pump []interface{}
id, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.id", numRows)
name, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.name", numRows)
res, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.res", numRows)
spill, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.spill", numRows)
turb, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.turb", numRows)
pump, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.pump", numRows)
entries := dataframe.NewDataFrame(
dataframe.NewSeriesString("id", nil, id...),
dataframe.NewSeriesString("name", nil, name...),
dataframe.NewSeriesString("res", nil, res...),
dataframe.NewSeriesInt64("spill", nil, spill...),
dataframe.NewSeriesInt64("turb", nil, turb...),
dataframe.NewSeriesInt64("pump", nil, pump...),
)
entriesPr.ReadStop()
entriesFr.Close()
// sort entries by date of creation
sortKey := []dataframe.SortKey{
{Key: "datecreated", Desc: true},
}
ctx := context.Background()
entries.Sort(ctx, sortKey)
return entries
}
func main() {
loadEntriesParquet("cascades2.parquet")
}
Now the error I'm getting is this:
The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.
The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.
Oh right, didn't notice that. It's running now. Thanks :) How do I print or access the dataframe? (bet it's a stupid question, sorry I'm a Golang newbie)
the function returns a *dataframe.DataFrame object. You can see examples in the Readme.
However, when I look at the code, it's not efficient at loading the Dataframe with the data. I need to understand that Parquet package better before I can improve the code.
Parquet importing is now supported (experimental): @CeciliaCoelho @khughitt @space55
Hello,
Are there any plans to support reading a Parquet file into a dataframe? I have a need for this and am evaluating this library to use in an application.
Thanks!