sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

How do I write a BigDecimal value? #177

Open xrl opened 5 years ago

xrl commented 5 years ago

I am loading numeric data from diesel with:

table! {
    currencies (id) {
        [[ SNIP ]]]
        conversion_rate -> Nullable<Numeric>,
        [[ SNIP ]]]
    }
}

cast to a struct with

use bigdecimal::BigDecimal;

#[derive(Queryable, Debug)]
pub struct Currency {
    [[ SNIP ]]
    pub conversion_rate: Option<BigDecimal>,
    [[ SNIP ]]
}

The current ColumnWriter does not include a easily compatible decimal type:

/// Column writer for a Parquet type.
pub enum ColumnWriter {
  BoolColumnWriter(ColumnWriterImpl<BoolType>),
  Int32ColumnWriter(ColumnWriterImpl<Int32Type>),
  Int64ColumnWriter(ColumnWriterImpl<Int64Type>),
  Int96ColumnWriter(ColumnWriterImpl<Int96Type>),
  FloatColumnWriter(ColumnWriterImpl<FloatType>),
  DoubleColumnWriter(ColumnWriterImpl<DoubleType>),
  ByteArrayColumnWriter(ColumnWriterImpl<ByteArrayType>),
  FixedLenByteArrayColumnWriter(ColumnWriterImpl<FixedLenByteArrayType>)
}

I think I found the decimal type definition here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and it looks like it's up to the writer to move between i32/i64/fixed_len_byte_array/binary_array. Is that right? Should we have a ColumnWriter that picks the right type?

For now I'm just going to throw away precision and write a double but I want to come back and do this right 👍

but I see decimal support (read only) in https://github.com/sunchao/parquet-rs/pull/103 so maybe that work could be extended to be compatible with ColumnWriter?

sadikovi commented 5 years ago

Yes, you are right. You would have to write physical types and assign DECIMAL logical type to it. And you are also right about having something that abstracts writes for those fields. It looks like it could be decimal, string, timestamp, date fields, etc.

Not sure about adding this to column writer, but we definitely need something. Did you have any particular design in mind?

xrl commented 5 years ago

What do you think of having BigDecimal support behind a feature flag? Then that could add a variant for a column writer?

sadikovi commented 5 years ago

I think you mean Decimal. What Decimal support are you talking about? Decimal is not a Parquet type, and we have column writers for each one of those. You can write decimal values in three different ways. I will start working on record writer, this should solve most of your problems.

xrl commented 5 years ago

I was talking about BigDecimal as the popular (or is it?) rust library for handling arbitrary precision numbers. The diesel library activates its support like this:

diesel = { version = "1.0.0", features = ["numeric"] }

this activates the BigDecimal crate dependency and turns on some modules in the diesel library. Something similar would allow parquet-rs users to get native bigdecimal support without forcing the dependency on all users.

sadikovi commented 5 years ago

Fair enough. Sorry, I feel like I lost the context.

Does this answer your question(s)?

xrl commented 5 years ago

Yes, this one: If you are talking about some generic BigDecimal crate support in parquet-rs

Having to translate the scale/precision seems like cutting the translation too early. Is it common to work with scale/precision directly? Are there many popular options for arbitrary precision values?

Would you be open to transparent BigDecimal serialization/deserialization?