lquerel / gcp-bigquery-client

GCP BigQuery Client (Rust)
Apache License 2.0
97 stars 67 forks source link

Unable to write double/f64 on float columns using the Storage Write API #106

Closed fungiboletus closed 1 week ago

fungiboletus commented 4 months ago

Hi,

I am testing the new Storage Write API and I encounter some issues with FLOAT/FLOAT64 columns. As I understood, BigQuery only has 64 bits floats columns, but I only manage to write 32 bits floats from Rust on those columns.

On version 0.22.0, double/f64 fails silently, with NULL values being inserted instead.

My table schema looks like this:

TableSchema::new(vec![
  TableFieldSchema::new("str", FieldType::String),
  TableFieldSchema::new("float", FieldType::Float), // is apparently 64 bits
  TableFieldSchema::new("float64", FieldType::Float64), // supposedly the same
  TableFieldSchema::new("numeric", FieldType::Numeric), 
])

My field descriptors like this:

let field_descriptors = vec![
  FieldDescriptor {
    name: "str".to_string(),
    number: 1,
    typ: ColumnType::String,
  },
  FieldDescriptor {
    name: "float".to_string(),
    number: 2,
    typ: ColumnType::Float64,
  },
  FieldDescriptor {
    name: "float64".to_string(),
    number: 3,
    typ: ColumnType::Float64,
  },
  FieldDescriptor {
    name: "numeric".to_string(),
    number: 4,
    typ: ColumnType::Bytes,
  },
];

My Prost struct like this:

[derive(Clone, PartialEq, prost::Message)]
struct Sample {
    #[prost(string, tag = "1")]
    str: String,
    #[prost(float, tag = "2")]
    float: f32,
    #[prost(double, tag = "3")]
    float64: f64,
    #[prost(bytes, tag = "4")]
    numeric: Vec<u8>,
}

the String, f32, and numeric values work well though.

Have you experienced the same?

lquerel commented 3 weeks ago

I didn’t implement this feature directly; it was a contribution. @imor, do you have any idea why the f64 isn’t working?

imor commented 3 weeks ago

@fungiboletus let me take a look.

imor commented 3 weeks ago

Hey @fungiboletus, I've started a fix in the draft PR https://github.com/lquerel/gcp-bigquery-client/pull/112. One thing which I found a bit weird is that the f32 types appear to have some precision loss in BigQuery. E.g. when I inserted an f32 with value 1.2, it appeared as 1.2000000476837158. I've observed the exact same behaviour with the official python library as well, so not sure if it's a bug in our implementation or something wrong with BigQuery. Can you test a bit with the branch in the PR and share your results. I'm also looking if we need to add support for other types as well and will include any other missing types.

fore5fire commented 2 weeks ago

@imor this isn't loss of precision in the traditional sense, but rather just that 1.2 can't be exactly represented by IEEE 754 64 bit floats. But it's not BigQuery specific, just casting an f32 to an f64 in pure rust does the same: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=469b96b4a3165289bef2bfd75a91a55d

And given that the BigQuery only has a FLOAT64 type and no FLOAT32, I think this is very unsurprising and unavoidable behavior. As a general rule, if you need exact value then floating point is the wrong format to begin with

imor commented 2 weeks ago

@fore5fire thanks for the comment. Makes sense. I suspected this but then was confused by the fact that when I was inserting into BigQuery using SQL, 1.2 was shown in exact representation. I think that could be due to SQL converting the literal value 1.2 directly into FLOAT64 while we were using an f32. Haven't spent too much time on it to be honest. I'll test a bit more and mark the PR as ready.

imor commented 2 weeks ago

I spent some time on this today. The immediate cause of the reported issue was that the Prost struct serialized the float64 field as double protobuf type while in FieldDescriptorProto it was set as float (ref). This discrepancy resulted in BigQuery ignoring this field and inserting null instead. When I made the field required, the API started failing with the following error:

RowError {
    index: 0,
    code: FieldsError,
    message: "Field value of f64 cannot be empty.",
},

As in the current draft PR, it can be corrected by mapping a ColumnType::Float64 to Type::Double. But a careful look at the code exposes a misunderstanding on my part. I thought ColumnType represents BigQuery column types. But in reality it should be protobuf types.

This analysis also exposed another weakness of the current design: specifying the same information twice. Once as prost annotations on the struct and the second time as a FieldDescriptor. The python library solves this (I think) by reflection where you do something like this:

# define protobuf schema once in the class
class Actor(proto.Message):
    actor_id = proto.Field(proto.INT64, number=1)
    float = proto.Field(proto.FLOAT, number=2)
    double = proto.Field(proto.DOUBLE, number=3)

proto_descriptor = DescriptorProto()
# then copy it to `proto_descriptor`
Actor.pb().DESCRIPTOR.CopyToProto(proto_descriptor)

Doing something similar in Rust means updating the prost's derive(Message) macro which I'm not sure prost project would accept as a PR. Or writing our own derive macro perhaps. Since that is quite a lot of work, for now I'm going to update the ColumnType enum to be basic protobuf field types (except Enum, Message and Group) and the users would have to be careful to keep types the same between prost annotations and TableSchema.

Edit: the above is unfortunately a breaking change, but that is the correct way to fix the issue.

The precision loss issue is also confirmed to be due to conversion from an f32 to an f64. Interestingly, an f64 can represent 1.2 exactly, it's only during conversion that this happens. To avoid it users can use f64 directly and no fix is needed in the code for this.

fungiboletus commented 2 weeks ago

Thanks a lot for your work on this issue. I remember being slightly confused about the various mappings so I simply copy pasted the examples and tried all permutations to make f64 work, hoping for the best 😊

I think that specifying the information twice is an acceptable tradeoff.

Do you still want me to test the PR or should I wait a bit for the breaking change to happen?

imor commented 2 weeks ago

@fungiboletus no testing needed for now, I've marked the PR ready to merge. You can use PR's branch if you want the code with the fix. Note that it has breaking changes with older code so minor adjustments are expected in your code.

fore5fire commented 2 weeks ago

I'm going to update the ColumnType enum to be basic protobuf field types (except Enum, Message and Group)

@imor Is there a specific reason besides the additional work required in adding the nested descriptors that you're not adding Enum and Message types? I'm working on a PR which adds support for these, and it would be helpful to know of any specific blockers you had in mind

imor commented 2 weeks ago

Is there a specific reason besides the additional work required in adding the nested descriptors that you're not adding Enum and Message types?

I thought there were no corresponding BigQuery types for these. I might be mistaken. I guess Enums can be mapped to integers (or strings) maybe? I'm not certain. If you have ideas about how to add support for this I'd love a PR for this work.

fore5fire commented 2 weeks ago

Good to hear there's not some big blocker I wasn't aware of. Per the docs, enums map to strings and messages/structs map to records. I'll incorporate your type updates and open a PR once I test it a bit in my project

imor commented 2 weeks ago

Interestingly, an f64 can represent 1.2 exactly, it's only during conversion that this happens

A correction regarding this statement I made. An f32 can't represent 1.2 exactly, it only appeared that it did because only limited digits were printed. When we print more digits it becomes clear that there's no precision loss:

fn main() {
    let f: f32 = 1.2;
    println!("{:.32}", f);
    println!("{:.32}", f as f64);
}

The above print 32 digits after the decimal point resulting in the following output:

1.20000004768371582031250000000000
1.20000004768371582031250000000000