rocketlaunchr / dataframe-go

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Other
1.16k stars 93 forks source link

Reading from Parquet #46

Closed space55 closed 3 years ago

space55 commented 3 years ago

Hello,

Are there any plans to support reading a Parquet file into a dataframe? I have a need for this and am evaluating this library to use in an application.

Thanks!

pjebs commented 3 years ago

I've had lots of people in the past people asking for exporting to parquet, which I implemented. You're the first to ask about importing, but I had put it in my todo list in may. I won't have time to implement it soon. However, you can issue as PR.

pjebs commented 3 years ago

Hmmm. I noticed in my TODO list (https://github.com/rocketlaunchr/dataframe-go/issues/17), there had been 3 thumbs up for that request.

khughitt commented 3 years ago

In case it helps, here is some code I wrote to read a parquet file into a DataFrame that you may be able to adapt in the meantime:

package main

import (
    dataframe "github.com/rocketlaunchr/dataframe-go"
    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/reader"
        "context"
    "runtime"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
    // create local parquet/reader instances
    entriesFr, err := local.NewLocalFileReader(inputParquet)

    if err != nil {
        log.Println("Can't open file")
    }

    entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

    if err != nil {
        log.Println("Unable to create parquet reader", err)
    }

    // determine numer of rows in input parquet file
    numRows := int64(entriesPr.GetNumRows())

    // read columns from parquet and use them to construct a DataFrame instance of the
    // same form
    var paths, titles, bodies, accesscounts, accessdates, createddates, deadlines, priorities, archived []interface{}

    paths, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.path", numRows)
    titles, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.title", numRows)
    bodies, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.body", numRows)
    accesscounts, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.accesscount", numRows)
    accessdates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.lastaccess", numRows)
    createddates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.datecreated", numRows)
    deadlines, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.deadline", numRows)
    priorities, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.priority", numRows)
    archived, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.archived", numRows)

    entries := dataframe.NewDataFrame(
        dataframe.NewSeriesString("path", nil, paths...),
        dataframe.NewSeriesString("title", nil, titles...),
        dataframe.NewSeriesString("body", nil, bodies...),
        dataframe.NewSeriesInt64("accesscount", nil, accesscounts...),
        dataframe.NewSeriesInt64("lastaccess", nil, accessdates...),
        dataframe.NewSeriesInt64("datecreated", nil, createddates...),
        dataframe.NewSeriesInt64("deadline", nil, deadlines...),
        dataframe.NewSeriesInt64("priority", nil, priorities...),
        dataframe.NewSeriesInt64("archived", nil, archived...),
    )

    entriesPr.ReadStop()
    entriesFr.Close()

    // sort entries by date of creation
    sortKey := []dataframe.SortKey{
        {Key: "datecreated", Desc: true},
    }

    ctx := context.Background()
    entries.Sort(ctx, sortKey)

    return entries
}

Few comments:

  1. I can't make any claims that it is the most efficient approach, and feedback is welcome, but at least this should do the job..
  2. The function loads a parquet dataframe containing "entries", with an expected format.. I left a lot of the file-specific logic in there to provide examples of how to handle different variable types.
  3. I also left some logic in the bottom to help sort the dataframe once it's been loaded, in case that is useful.

Cheers.

pjebs commented 3 years ago

Thanks @khughitt . I need to generalise it so that it works for anything parquet data.

CeciliaCoelho commented 3 years ago

@pjebs Did you manage to generalise it? Can't get it to work, getting this error.

image

pjebs commented 3 years ago

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: https://github.com/xitongsys/parquet-go/issues/360

CeciliaCoelho commented 3 years ago

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

Getting a new error now. This was a CSV that I converted to parquet using python but wanted to open and use in Go because of efficiency. The CSV was like this: image

I have this code:

package main

import (
    "context"
    "log"
    "runtime"

    dataframe "github.com/rocketlaunchr/dataframe-go"
    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/reader"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
    // create local parquet/reader instances
    entriesFr, err := local.NewLocalFileReader(inputParquet)

    if err != nil {
        log.Println("Can't open file")
    }

    entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

    if err != nil {
        log.Println("Unable to create parquet reader", err)
    }

    // determine numer of rows in input parquet file
    numRows := int64(entriesPr.GetNumRows())

    // read columns from parquet and use them to construct a DataFrame instance of the
    // same form
    var id, name, res, spill, turb, pump []interface{}

    id, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.id", numRows)
    name, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.name", numRows)
    res, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.res", numRows)
    spill, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.spill", numRows)
    turb, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.turb", numRows)
    pump, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.pump", numRows)

    entries := dataframe.NewDataFrame(
        dataframe.NewSeriesString("id", nil, id...),
        dataframe.NewSeriesString("name", nil, name...),
        dataframe.NewSeriesString("res", nil, res...),
        dataframe.NewSeriesInt64("spill", nil, spill...),
        dataframe.NewSeriesInt64("turb", nil, turb...),
        dataframe.NewSeriesInt64("pump", nil, pump...),
    )

    entriesPr.ReadStop()
    entriesFr.Close()

    // sort entries by date of creation
    sortKey := []dataframe.SortKey{
        {Key: "datecreated", Desc: true},
    }

    ctx := context.Background()
    entries.Sort(ctx, sortKey)

    return entries
}

func main() {
    loadEntriesParquet("cascades2.parquet")
}

Now the error I'm getting is this: image

pjebs commented 3 years ago

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

CeciliaCoelho commented 3 years ago

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

Oh right, didn't notice that. It's running now. Thanks :) How do I print or access the dataframe? (bet it's a stupid question, sorry I'm a Golang newbie)

pjebs commented 3 years ago

the function returns a *dataframe.DataFrame object. You can see examples in the Readme.

However, when I look at the code, it's not efficient at loading the Dataframe with the data. I need to understand that Parquet package better before I can improve the code.

pjebs commented 3 years ago

Parquet importing is now supported (experimental): @CeciliaCoelho @khughitt @space55