rapidsai / node

GPU-accelerated data science and visualization in node
https://rapidsai.github.io/node/
Apache License 2.0
187 stars 20 forks source link

Cannot execute a `sum` on a `DataFrame` created with `readParquet` #429

Closed maxime-petitjean closed 2 years ago

maxime-petitjean commented 2 years ago

If I try to execute this code:

const { DataFrame } = require('@rapidsai/cudf');
const frame = DataFrame.readParquet({ sourceType: 'files', sources: ['data.parquet'] });
const result = frame.sum(); // throw!

I have the error sum operation requires dataframe to be entirely of dtype FloatingPoint OR Integral. but parquet file contains only Float64 columns.

If I explicitly cast columns to Float64, it's working!

const { DataFrame, Float64 } = require('@rapidsai/cudf');
const frame = DataFrame.readParquet({ sourceType: 'files', sources: ['data.parquet'] });
const casted = frame.cast({ col1: new Float64(), col2: new Float64() });
const result = casted.sum(); // OK

If I log frame types I get:

Instance type of column type seems to be lost in readParquet function (type serialisation?).

trxcllnt commented 2 years ago

@maxime-petitjean thanks for the bug report! That sounds like we're not fixing the types coming from C++ after loading the parquet file. I'll make a PR real quick with a fix.