hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.54k stars 313 forks source link

Load DataFrame from Arrow Table/csv file #65

Closed swilson314 closed 4 years ago

swilson314 commented 4 years ago

Here is my early code for reading a csv file into a DataFrame via Apache Arrow. Before I flesh out all the data types, I wanted to verify that this approach looks good? Also, is there something like a pretty print for dataframes?

// ArrowCsv.cpp
// SBW 2020.04.07

#include <cstdint>
#include <memory>
#include <numeric>
#include <string>
#include <iostream>

#include "arrow/api.h"
#include "arrow/filesystem/localfs.h"
#include "arrow/csv/api.h"
#include "arrow/result.h"

// SBW 2020.04.03 Attach Arrow table to DataFrame.
#define LIBRARY_EXPORTS
#include <DataFrame/DataFrame.h>
using namespace hmdf;
typedef StdDataFrame<unsigned long> MyDataFrame;

using namespace std;
using namespace arrow;

template<typename I, typename  H>
bool TableToDataFrame(const Table& Tbl, DataFrame<I, H>& Df)
{
    int64_t Rows = Tbl.num_rows();
    int Cols = Tbl.num_columns();
    for (int c = 0; c < Cols; c++)
    {
        auto f = Tbl.field(c);
        const string& Name = f->name();
        int TypeId = f->type()->id();
        switch (TypeId)
        {
        case Type::STRING:
        {
            std::vector<string>& vec = Df.create_column<string>(Name.c_str());
            vec.assign(Rows, "");
            auto pChArray = Tbl.column(c);
            int NChunks = pChArray->num_chunks();
            int i = 0;
            for (int n = 0; n < NChunks; n++)
            {
                auto pArray = pChArray->chunk(n);
                int64_t ArrayRows = pArray->length();
                auto pTypedArray = std::static_pointer_cast<arrow::StringArray>(pArray);
                // const string* pData = pTypedArray->raw_values();
                for (int j = 0; j < ArrayRows; j++)
                    vec[i++] = pTypedArray->GetString(j);
            }
            break;
        }
        case Type::FLOAT:
        {
            std::vector<float>& vec = Df.create_column<float>(Name.c_str());
            vec.assign(Rows, 0.0);
            auto pChArray = Tbl.column(c);
            int NChunks = pChArray->num_chunks();
            int i = 0;
            for (int n = 0; n < NChunks; n++)
            {
                auto pArray = pChArray->chunk(n);
                int64_t ArrayRows = pArray->length();
                auto pTypedArray = std::static_pointer_cast<arrow::FloatArray>(pArray);
                const float* pData = pTypedArray->raw_values();
                for (int j = 0; j < ArrayRows; j++)
                    vec[i++] = pData[j];
            }
            break;
        }
        case Type::DOUBLE:
        {
            std::vector<double>& vec = Df.create_column<double>(Name.c_str());
            vec.assign(Rows, 0.0);
            auto pChArray = Tbl.column(c);
            int NChunks = pChArray->num_chunks();
            int i = 0;
            for (int n = 0; n < NChunks; n++)
            {
                auto pArray = pChArray->chunk(n);
                int64_t ArrayRows = pArray->length();
                auto pTypedArray = std::static_pointer_cast<arrow::DoubleArray>(pArray);
                const double* pData = pTypedArray->raw_values();
                for (int j = 0; j < ArrayRows; j++)
                    vec[i++] = pData[j];
            }
            break;
        }
        default:
            assert(false); // unknown type
        }
    }

    return(true);
}

int main(int argc, char *argv[])
{
    auto fs = make_shared<fs::LocalFileSystem>();
    auto r_input = fs->OpenInputStream("c:/temp/Test.csv");

    auto pool = default_memory_pool();
    auto read_options = arrow::csv::ReadOptions::Defaults();
    auto parse_options = arrow::csv::ParseOptions::Defaults();
    auto convert_options = arrow::csv::ConvertOptions::Defaults();

    auto r_table_reader = csv::TableReader::Make(pool, r_input.ValueOrDie(),
        read_options, parse_options, convert_options);
    auto r_read = r_table_reader.ValueOrDie()->Read();
    auto pTable = r_read.ValueOrDie();

    PrettyPrintOptions options{0};
    arrow::PrettyPrint(*pTable, options, &std::cout);
    //arrow::PrettyPrint(*pTable->schema(), options, &std::cout);

    // SBW 2020.04.03 Attach Arrow table to DataFrame.
    MyDataFrame df;
    // df_read.read("c:/temp/Test.csv");
    TableToDataFrame(*pTable, df);
    df.write<std::ostream, int, unsigned long, double, std::string>(std::cout);

    return 1;
}
swilson314 commented 4 years ago

If this looks ok, I'll use the following function to simplify the addition of the various numeric types.

template<typename I, typename  H, typename CppType, typename ArrType>
void NumericColumnToDataFrame(const Table& Tbl, DataFrame<I, H>& Df, int c)
{
    int64_t Rows = Tbl.num_rows();
    auto f = Tbl.field(c);
    const string& Name = f->name();
    std::vector<CppType>& vec = Df.create_column<CppType>(Name.c_str());
    vec.assign(Rows, 0.0);
    auto pChArray = Tbl.column(c);
    int NChunks = pChArray->num_chunks();
    int i = 0;
    for (int n = 0; n < NChunks; n++)
    {
        auto pArray = pChArray->chunk(n);
        int64_t ArrayRows = pArray->length();
        auto pTypedArray = std::static_pointer_cast<ArrType>(pArray);
        const CppType* pData = pTypedArray->raw_values();
        for (int j = 0; j < ArrayRows; j++)
            vec[i++] = pData[j];
    }
}
hosseinmoein commented 4 years ago

It looks fantastic. And it is a very useful addition. I have some mostly cosmetic suggestions

  1. Don’t forget to use const. You can never overuse const. It makes the code easier to understand and helps the compiler to optimize your code. for example instead of int Cols = Tbl.num_columns() use const int Cols = Tbl.num_columns()
  2. Get rid of float. It is useless. Just combine it with double for DataFrame.
  3. If possible, do not create const references to std::string such as const string& Name = f->name(). Instead use const char *, if easy to do. Same goes with std::vector<std::string>.
  4. I follow STL naming convention. Arrow does the same. I suggest you also follow that convention. Function and variable names are all lower case separated with _.
  5. Do not combine #include<xyz> with #include”xyz”. Just use the former.
  6. I assume the DataFrame passed to your function already has an index column loaded. Otherwise you cannot load a column, unless you already have an index. In that case the length of the newly created column would be equal to the length of the index.
  7. There is no need for vec.assign(Rows, ""). It is already that way
swilson314 commented 4 years ago

Thanks for the feedback.

  1. I often use float for large datasets because it requires less storage.
  2. The DataFrame I pass is empty but the code works without adding an index. Arrow does not require it's tables to have an index column. I'm not sure what to do here.
  3. The assign is to set the vector length. Maybe this is related to 6?
  4. Is there something like a pretty print function I can use to print the table upon completion?

On Tue, Apr 7, 2020 at 1:40 PM Hossein Moein notifications@github.com wrote:

It looks fantastic. And it is a very useful addition. I have some mostly cosmetic suggestions

  1. Don’t forget to use const. You can never overuse const. It makes the code easier to understand and helps the compiler to optimize your code. for example instead of int Cols = Tbl.num_columns() use const int Cols = Tbl.num_columns()
  2. Get rid of float. It is useless. Just combine it with double for DataFrame.
  3. If possible, do not create const references to std::string such as const string& Name = f->name(). Instead use const char *, if easy to do. Same goes with std::vector.
  4. I follow STL naming convention. Arrow does the same. I suggest you also follow that convention. Function and variable names are all lower case separated with _.
  5. Do not combine #include with #include”xyz”. Just use the former.
  6. I assume the DataFrame passed to your function already has an index column loaded. Otherwise you cannot load a column, unless you already have an index. In that case the length of the newly created column would be equal to the length of the index.
  7. There is no need for vec.assign(Rows, ""). It is already that way

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hosseinmoein/DataFrame/issues/65#issuecomment-610609097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3FXFREDSKRSNPCQCR6XYLRLOFUHANCNFSM4MDMHS5Q .

-- Scott B. Wilson Chairman and Chief Scientist Persyst Development Corporation 420 Stevens Avenue, Suite 210 Solana Beach, CA 92075

hosseinmoein commented 4 years ago

"6". DataFrame must have an index. just use a sequenced index. like here https://github.com/hosseinmoein/DataFrame/blob/master/test/dataframe_tester.cc#L2631. Do that with the right number of items before you create any column.

"10". https://github.com/hosseinmoein/DataFrame/blob/master/test/dataframe_tester.cc#L355

swilson314 commented 4 years ago

Ok, here is my updated version.

// ArrowCsv.cpp: example for moving Apache Arrow Table (read from csv) to DataFrame.
// SBW 2020.04.07

#include <cstdint>
#include <memory>
#include <numeric>
#include <string>
#include <iostream>

#include <arrow/api.h>
#include <arrow/filesystem/localfs.h>
#include <arrow/csv/api.h>
#include <arrow/result.h>

// SBW 2020.04.03 Attach Arrow table to DataFrame.
#define LIBRARY_EXPORTS
#include <DataFrame/DataFrame.h>
using namespace hmdf;
typedef StdDataFrame<unsigned long> MyDataFrame;

using namespace std;
using namespace arrow;

// SBW 2020.04.07 Refactor with helper NumericColumnToDataFrame.
template<typename I, typename  H, typename CppType, typename ArrType>
void NumericColumnToDataFrame(const Table& tlb, DataFrame<I, H>& df, int c)
{
    const int64_t rows = tlb.num_rows();
    auto f = tlb.field(c);
    const string& name = f->name();
    std::vector<CppType>& vec = df.create_column<CppType>(name.c_str());
    vec.assign(rows, 0);
    auto ch_arr = tlb.column(c);
    const int nchunks = ch_arr->num_chunks();
    int i = 0;
    for (int n = 0; n < nchunks; n++)
    {
        auto arr = ch_arr->chunk(n);
        int64_t arr_rows = arr->length();
        auto typed_arr = std::static_pointer_cast<ArrType>(arr);
        const CppType* data = typed_arr->raw_values();
        for (int j = 0; j < arr_rows; j++)
            vec[i++] = data[j];
    }
}

template<typename I, typename  H>
bool TableToDataFrame(const Table& tlb, DataFrame<I, H>& df)
{
    const int64_t rows = tlb.num_rows();
    const int cols = tlb.num_columns();

    // DataFrame requires sequence index.
    df.load_data(DataFrame<I, H>::gen_sequence_index(1, rows));

    for (int c = 0; c < cols; c++)
    {
        auto f = tlb.field(c);
        const string& name = f->name();
        int type_id = f->type()->id();
        switch (type_id)
        {
        case Type::STRING:
        {
            std::vector<string>& vec = df.create_column<string>(name.c_str());
            vec.assign(rows, "");
            auto ch_arr = tlb.column(c);
            int nchunks = ch_arr->num_chunks();
            int i = 0;
            for (int n = 0; n < nchunks; n++)
            {
                auto arr = ch_arr->chunk(n);
                int64_t arr_rows = arr->length();
                auto typed_arr = std::static_pointer_cast<arrow::StringArray>(arr);
                for (int j = 0; j < arr_rows; j++)
                    vec[i++] = typed_arr->GetString(j);
            }
            break;
        }
        case Type::BOOL:
        {
            std::vector<bool>& vec = df.create_column<bool>(name.c_str());
            vec.assign(rows, false);
            auto ch_arr = tlb.column(c);
            int nchunks = ch_arr->num_chunks();
            int i = 0;
            for (int n = 0; n < nchunks; n++)
            {
                auto arr = ch_arr->chunk(n);
                int64_t arr_rows = arr->length();
                auto typed_arr = std::static_pointer_cast<arrow::BooleanArray>(arr);
                for (int j = 0; j < arr_rows; j++)
                    vec[i++] = typed_arr->GetView(j);
            }
            break;
        }
        case Type::FLOAT:
            NumericColumnToDataFrame<I, H, float, arrow::FloatArray>(tlb, df, c);
            break;
        case Type::DOUBLE:
            NumericColumnToDataFrame<I, H, double, arrow::DoubleArray>(tlb, df, c);
            break;
        case Type::UINT8:
            NumericColumnToDataFrame<I, H, uint8_t, arrow::UInt8Array>(tlb, df, c);
            break;
        case Type::INT8:
            NumericColumnToDataFrame<I, H, int8_t, arrow::Int8Array>(tlb, df, c);
            break;
        case Type::UINT16:
            NumericColumnToDataFrame<I, H, uint16_t, arrow::UInt16Array>(tlb, df, c);
            break;
        case Type::INT16:
            NumericColumnToDataFrame<I, H, int16_t, arrow::Int16Array>(tlb, df, c);
            break;
        case Type::UINT32:
            NumericColumnToDataFrame<I, H, uint32_t, arrow::UInt32Array>(tlb, df, c);
            break;
        case Type::INT32:
            NumericColumnToDataFrame<I, H, int32_t, arrow::Int32Array>(tlb, df, c);
            break;
        case Type::UINT64:
            NumericColumnToDataFrame<I, H, uint64_t, arrow::UInt64Array>(tlb, df, c);
            break;
        case Type::INT64:
            NumericColumnToDataFrame<I, H, int64_t, arrow::Int64Array>(tlb, df, c);
            break;
        default:
            assert(false); // unsupported type
        }
    }

    return true;
}

int main(int argc, char *argv[])
{
    auto fs = make_shared<fs::LocalFileSystem>();
    auto r_input = fs->OpenInputStream("c:/temp/Test.csv");

    auto pool = default_memory_pool();
    auto read_options = arrow::csv::ReadOptions::Defaults();
    auto parse_options = arrow::csv::ParseOptions::Defaults();
    auto convert_options = arrow::csv::ConvertOptions::Defaults();

    auto r_table_reader = csv::TableReader::Make(pool, r_input.ValueOrDie(),
        read_options, parse_options, convert_options);
    auto r_read = r_table_reader.ValueOrDie()->Read();
    auto pTable = r_read.ValueOrDie();

    PrettyPrintOptions options{0};
    arrow::PrettyPrint(*pTable, options, &std::cout);
    //arrow::PrettyPrint(*pTable->schema(), options, &std::cout);

    // SBW 2020.04.03 Attach Arrow table to DataFrame.
    MyDataFrame df;
    // df_read.read("c:/temp/Test.csv");
    TableToDataFrame(*pTable, df);
    df.write<std::ostream, int, unsigned long, double, std::string>(std::cout);

    return 1;
}
swilson314 commented 4 years ago

Btw, does hmdf support the notion of category columns, to save space on string columns that have only a handful of distinct values?

hosseinmoein commented 4 years ago

I do not have a category type per se. But in C++ you can define your own enum class of any type and then create a column of that enum type.

hosseinmoein commented 4 years ago

To load index, instead of

df.load_data(DataFrame<I, H>::gen_sequence_index(1, rows));

do

df.load_index(DataFrame<I, H>::gen_sequence_index(1, rows));
swilson314 commented 4 years ago

Thanks, changed to load_index(). My plan is to check Table columns to see if index already exists and use this if it does. Next I'll generate a DataFrameToTable() function so data can be saved as csv/arrow/parquet/etc.

swilson314 commented 4 years ago

Per the categories, below is what I mean, per what you can do in python, where a column of strings can be changed into ints with a lookup table. From the user's point of view, it still looks like a column of strings.

df['Dataset'] = df['Dataset'].astype('category')
hosseinmoein commented 4 years ago

In Pandas category is a half-baked concept. I wouldn't use it. It is not based on numpy arrays. In C++ and DataFrame you can use enums to achieve the functionality of categories. see: https://en.wikipedia.org/wiki/Categorical_variable

swilson314 commented 4 years ago

How would you use enums on unknown text values?

swilson314 commented 4 years ago

I'm trying to implement a DataFrameToTable() function. I can't figure out how to get access to the columns if I don't know the column names. All the code I've examined iterates over columntb, but this is a private member, which I can't access in a non-member function. I don't see any way to get all the column names ... I suspect I just don't see it.

hosseinmoein commented 4 years ago

I can add something for that

hosseinmoein commented 4 years ago

@swilson314 I implemented get_columns_info() to give you column names

hosseinmoein commented 4 years ago

ohh, also changed the Windows timezone setup. Please, see if the timezone test still runs fine

swilson314 commented 4 years ago

This line asserts: assert(idx_vec1[0] == 1514782800) The value in idx_vec1[0] is 1514764800.

hosseinmoein commented 4 years ago

ok, I am going to reverse the change for timezone But please let me know if get_columns_info() works for you

swilson314 commented 4 years ago

No, the size of the vector returned by get_columns_info() is zero. (The capacity is correct.)

image

hosseinmoein commented 4 years ago

I don't understand what you mean? Does the dataframe_tester_2 run fine? Also see the documentation

swilson314 commented 4 years ago

When I call get_columns_info() on a df with three columns, it returns a vector with size()==0. I don't think the code is working correctly, shouldn't the size be 3?

I haven't run dataframe_tester_2.

I don't understand what you mean "see the documentation".

Here is my code, all_info.size()==0 which is incorrect since there are three columns.

template<typename I, typename  H>
arrow::Status DataFrameToTable(const DataFrame<I, H>& df, std::shared_ptr<Table>* table)
{
    auto all_info = df.get_columns_info();
    std::vector<std::shared_ptr<Array>> arrays;
    std::vector<std::shared_ptr<Field>> fields;
    for (int c = 0; c < all_info.size(); c++)
    {
        auto info = all_info[c];
        auto name = get<0>(info).c_str();
        auto size = get<1>(info);
        auto type = get<2>(info);

        if (type == std::type_index(typeid(std::string)))
        {
            auto col = df.get_column<std::string>(name);
            arrow::StringBuilder builder;
            builder.Resize(col.size());
            builder.AppendValues(col);
            shared_ptr<arrow::Array> array;
            arrow::Status st = builder.Finish(&array);
            arrays.push_back(array);
            std::shared_ptr<arrow::Field> field = arrow::field(name, arrow::utf8());
            fields.push_back(field);
        }
        else if (type == std::type_index(typeid(double)))
        {
            auto col = df.get_column<double>(name);
            arrow::DoubleBuilder builder;
            builder.Resize(col.size());
            builder.AppendValues(col);
            shared_ptr<arrow::Array> array;
            arrow::Status st = builder.Finish(&array);
            arrays.push_back(array);
            std::shared_ptr<arrow::Field> field = arrow::field(name, arrow::float32());
            fields.push_back(field);
        }
    }
    std::shared_ptr<Schema> schema = arrow::schema(fields);
    *table = Table::Make(schema, arrays);

    return Status::OK();
}
hosseinmoein commented 4 years ago

You have to specify the types. That is the whole promise of C++ DataFrame. C++ is statically typed, so you must know the types at compile time. Look at the example here: https://github.com/hosseinmoein/DataFrame/blob/master/test/dataframe_tester_2.cc#L250

swilson314 commented 4 years ago

Ok, I think get_columns_info() is working as designed. I guess I don't understand why get_columns_info() can't just return info for all columns wo having to specify the types ... this just seems like a filter. Why can't I just get them all and then check the type_index if I only want specific types?

I'm still coming up to speed with the implementation, but it appears you can have multiple columns with the same name but different types. Is that right?

hosseinmoein commented 4 years ago

In C++ you cannot have a true heterogeneous container. You could simulate one which I am doing with DataFrame. So, DataFrame is simulating a true hetero-container. It doesn’t have a set of predefined types to support. You can add any built-in or user-defined type as a column at run time. This simulation is not free. It requires some tricks that cause some restrictions. Also, after all tricks and restrictions we cannot change the nature of C++ which is statically typed. That means, in C++, all types must be known at compile time. I use static containers keyed in a hash table to simulate a heterogeneous container. The key is address of a typed HeteroVector. Therefore I must know the type to be able to retrieve any column.

Re/ different typed columns with the same name; that is a good catch. I have to fix it

swilson314 commented 4 years ago

Thanks for the explanation. I don’t think it contradicts my point: you should be able to return a vector of column names/typeids without knowing any types.

Aside: In my use cases I typically have tens of thousands of columns. I’m concerned that accessing column data requires two lookups, one on the name and then the second into the static list of HeteroVector::vectors_.

On Fri, Apr 10, 2020 at 5:56 AM Hossein Moein notifications@github.com wrote:

In C++ you cannot have a true heterogeneous container. You could simulate one which I am doing with DataFrame. So, DataFrame is simulating a true hetero-container. It doesn’t have a set of predefined types to support. You can add any built-in or user-defined type as a column at run time. This simulation is not free. It requires some tricks that cause some restrictions. Also, after all tricks and restrictions we cannot change the nature of C++ which is statically typed. That means, in C++, all types must be known at compile time. I use static containers keyed in a hash table to simulate a heterogeneous container. The key is address of a typed HeteroVector. Therefore I must know the type to be able to retrieve any column.

Re/ different typed columns with the same name; that is a good catch. I have to fix it

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hosseinmoein/DataFrame/issues/65#issuecomment-612017120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3FXFSWKVJV6GH2T5I3L7DRL4JQDANCNFSM4MDMHS5Q .

-- Sent from Gmail Mobile

hosseinmoein commented 4 years ago

I don’t think it contradicts my point: you should be able to return a vector of column names/typeids without knowing any types.

This is not possible in C++, given you want a hetero vector. If you can write an example code that does that, you would be famous :-)

Accessing column data does not require two look ups. Finding the column requires one look up. Once you have the column, you access it like an STL vector.

swilson314 commented 4 years ago

It would be easy. Just cache the typeids, with the names, as the DF is built.

The fact that a user of the DF can’t even know the number of columns with the current interface is just weird. With the current implementation, I would always worry that there are "hidden" columns. This is a particular concern if you think about trying to save the DF to disk.

I see two lookups when I walk through the get_column() code. The first in columntb, the second in vectors_:

column_tb_.find (name)
vectors_<T>.find (this)

On Fri, Apr 10, 2020 at 9:37 AM Hossein Moein notifications@github.com wrote:

I don’t think it contradicts my point: you should be able to return a vector of column names/typeids without knowing any types.

This is not possible in C++, given you want a hetero vector. If you can write an example code that does that, you would be famous :-)

Accessing column data does not require two look ups. Finding the column requires one look up. Once you have the column, you access it like an STL vector.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hosseinmoein/DataFrame/issues/65#issuecomment-612111777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3FXFXIKPX5N4FUXIYHNOLRL5DL5ANCNFSM4MDMHS5Q .

hosseinmoein commented 4 years ago

First; it is not possible to cache the typeid’s because then I cannot retrieve the column to get its size. To illustrate this; implement a vector for me in C++ (let’s call it YourVector), that I can do this with:

YourVector  vec;

vec.push_back(2);
vec.push_back(5.3);
vec.push_back(std::string(“xyz”));
vec.push_back(std::vector<int>{1, 2, 3, 4});

vec[0] --> should return integer 2
vec[1] --> should return double 5.3
vec[2] --> should return string “xyz”
vec[3] --> should return vector {1, 2, 3, 4}

This is what DataFrame is. It is a vector of column vectors of different types. But it has those peculiarities that you noticed. What I said above was if you implement such vector without the peculiarities, you would be famous.

Second; I am not sure what you mean by hidden vectors? There are no hidden vectors?

Third; you are correct. Once you are passed those two hash lookups, you have a STL vector