Closed fungiboletus closed 1 week ago
I didn’t implement this feature directly; it was a contribution. @imor, do you have any idea why the f64 isn’t working?
@fungiboletus let me take a look.
Hey @fungiboletus, I've started a fix in the draft PR https://github.com/lquerel/gcp-bigquery-client/pull/112. One thing which I found a bit weird is that the f32 types appear to have some precision loss in BigQuery. E.g. when I inserted an f32 with value 1.2, it appeared as 1.2000000476837158. I've observed the exact same behaviour with the official python library as well, so not sure if it's a bug in our implementation or something wrong with BigQuery. Can you test a bit with the branch in the PR and share your results. I'm also looking if we need to add support for other types as well and will include any other missing types.
@imor this isn't loss of precision in the traditional sense, but rather just that 1.2 can't be exactly represented by IEEE 754 64 bit floats. But it's not BigQuery specific, just casting an f32 to an f64 in pure rust does the same: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=469b96b4a3165289bef2bfd75a91a55d
And given that the BigQuery only has a FLOAT64
type and no FLOAT32
, I think this is very unsurprising and unavoidable behavior. As a general rule, if you need exact value then floating point is the wrong format to begin with
@fore5fire thanks for the comment. Makes sense. I suspected this but then was confused by the fact that when I was inserting into BigQuery using SQL, 1.2 was shown in exact representation. I think that could be due to SQL converting the literal value 1.2 directly into FLOAT64
while we were using an f32
. Haven't spent too much time on it to be honest. I'll test a bit more and mark the PR as ready.
I spent some time on this today. The immediate cause of the reported issue was that the Prost struct serialized the float64
field as double
protobuf type while in FieldDescriptorProto
it was set as float
(ref). This discrepancy resulted in BigQuery ignoring this field and inserting null
instead. When I made the field required, the API started failing with the following error:
RowError {
index: 0,
code: FieldsError,
message: "Field value of f64 cannot be empty.",
},
As in the current draft PR, it can be corrected by mapping a ColumnType::Float64
to Type::Double
. But a careful look at the code exposes a misunderstanding on my part. I thought ColumnType
represents BigQuery column types. But in reality it should be protobuf types.
This analysis also exposed another weakness of the current design: specifying the same information twice. Once as prost annotations on the struct and the second time as a FieldDescriptor
. The python library solves this (I think) by reflection where you do something like this:
# define protobuf schema once in the class
class Actor(proto.Message):
actor_id = proto.Field(proto.INT64, number=1)
float = proto.Field(proto.FLOAT, number=2)
double = proto.Field(proto.DOUBLE, number=3)
proto_descriptor = DescriptorProto()
# then copy it to `proto_descriptor`
Actor.pb().DESCRIPTOR.CopyToProto(proto_descriptor)
Doing something similar in Rust means updating the prost's derive(Message)
macro which I'm not sure prost project would accept as a PR. Or writing our own derive macro perhaps. Since that is quite a lot of work, for now I'm going to update the ColumnType
enum to be basic protobuf field types (except Enum
, Message
and Group
) and the users would have to be careful to keep types the same between prost annotations and TableSchema
.
Edit: the above is unfortunately a breaking change, but that is the correct way to fix the issue.
The precision loss issue is also confirmed to be due to conversion from an f32
to an f64
. Interestingly, an f64
can represent 1.2 exactly, it's only during conversion that this happens. To avoid it users can use f64
directly and no fix is needed in the code for this.
Thanks a lot for your work on this issue. I remember being slightly confused about the various mappings so I simply copy pasted the examples and tried all permutations to make f64 work, hoping for the best 😊
I think that specifying the information twice is an acceptable tradeoff.
Do you still want me to test the PR or should I wait a bit for the breaking change to happen?
@fungiboletus no testing needed for now, I've marked the PR ready to merge. You can use PR's branch if you want the code with the fix. Note that it has breaking changes with older code so minor adjustments are expected in your code.
I'm going to update the ColumnType enum to be basic protobuf field types (except Enum, Message and Group)
@imor Is there a specific reason besides the additional work required in adding the nested descriptors that you're not adding Enum and Message types? I'm working on a PR which adds support for these, and it would be helpful to know of any specific blockers you had in mind
Is there a specific reason besides the additional work required in adding the nested descriptors that you're not adding Enum and Message types?
I thought there were no corresponding BigQuery types for these. I might be mistaken. I guess Enums can be mapped to integers (or strings) maybe? I'm not certain. If you have ideas about how to add support for this I'd love a PR for this work.
Good to hear there's not some big blocker I wasn't aware of. Per the docs, enums map to strings and messages/structs map to records. I'll incorporate your type updates and open a PR once I test it a bit in my project
Interestingly, an f64 can represent 1.2 exactly, it's only during conversion that this happens
A correction regarding this statement I made. An f32
can't represent 1.2 exactly, it only appeared that it did because only limited digits were printed. When we print more digits it becomes clear that there's no precision loss:
fn main() {
let f: f32 = 1.2;
println!("{:.32}", f);
println!("{:.32}", f as f64);
}
The above print 32 digits after the decimal point resulting in the following output:
1.20000004768371582031250000000000
1.20000004768371582031250000000000
Hi,
I am testing the new Storage Write API and I encounter some issues with
FLOAT/FLOAT64
columns. As I understood, BigQuery only has 64 bits floats columns, but I only manage to write 32 bits floats from Rust on those columns.On version
0.22.0
,double/f64
fails silently, withNULL
values being inserted instead.My table schema looks like this:
My field descriptors like this:
My Prost struct like this:
the String, f32, and numeric values work well though.
Have you experienced the same?